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REMARKS 

Reconsideration and allowance of the above-referenced application are respectfully 
requested. Claims 1, 7 and 10 have been amended, and claims 1-15 are pending in the 
application. 

Claims l,7and 10 stand rejected under 35 U.S. C. 112, second paragraph. The Examiner 
states that it is unclear what is meant by "size". The independent claims have been amended to 
recite "a tag size as a prescribed number of bits , of an address field of a network ". Support for 
this amendment can be found at page 5, lines 24-3 1 and page 6, lines 1 1-20 of the specification. 

It is submitted that all pending claims are in full compliance with 35 U.S.C. 1 12, second 
paragraph. Therefore, the rejection should be withdrawn. 

Claims 3-6, 9, and 13-15 stand rejected under 35 U.S.C. 112, second paragraph as 
containing a trademark since Applicant has not provided a copy of the InfmiBand'^^ Network 
Protocol specification which pre-dates the application filing date. Per the Examiner's 
requirement, Applicant submits, as Appendix A attached, a copy of "InfiniBand'^'^ Architecture 
Specification Volume 1 Release 1.0" dated October 24, 2000. Thus, it is submitted that the 
scope of the claims can be further defined in view of the Specification and the rejection should 
be withdrawn. 

Claims 1-2, 7-8, 10-12 stand rejected under 35 USC 103(a) as being unpatentable over 
Benayoun in view of Fan. This rejection is respectfully traversed. 

Claim 1 recites, configuring by the network manager each network switch of the network 
to switch each of the data packets based on a corresponding switching tag, added to a start of the 
corresponding data packet and the switching tag having the selected tag size of the address field , 
without altering the content of the header. The Examiner has not addressed this limitation in the 
rejection. As noted in Applicant's previous Remarks, the labels 1 8 of Benayoun do not have a 
selected size as claimed. To the contrary, any routing headers added by the switch in Benayoun 
have a fixed size. At column 3, lines 7- 1 7, Benayoun merely teaches that a packet has a flow-id 
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field or other packet parameters available such as destination address, source address, port 
number or protocol employed. Furthermore, Fan teaches that "the long addresses in the packet 
header are replaced by the corresponding short addresses, and the address type (long or short) is 
identified in the header" (colunm 6, lines 49-52, emphasis added); hence, "the packet with the 
shortened header is then forwarded to the destination node within the virtual address using the 
short address" (emphasis added, col. 6, lines 55-57). The address type (long or short) of Fan is 
not a disclosure of the claimed tag having the selected tag size of the address field because the 
tag is added to the start of the data packet without altering the content of the header . The claims 
as amended define the selected tag size as a prescribed number of bits . In Fan, the 6 bytes of the 
long address will always be replaced by a single byte (a fixed size, just like Benayoun) of the 
short address (see column 6, lines 50-54 of Fan). 

If Fan were combined with Benayoxm in the manner suggested by the Examiner, the 
routing header added by the switch in Benayoun would be replaced with a single byte short 
address as taught by Fan, indicating address type . This does not teach or suggest the claimed 
feature of configuring by the network manager each network switch of the network to switch 
each of the data packets based on a corresponding switching tag, added to a start of the 
corresponding data packet and the switching tag having the selected size of address fields , 
without altering the content of the header. In fact. Fan teaches away from what is claimed. 
Replacing a long address with a short address as in Fan alters the content of the header and 
cannot be considered to be adding a tag to a start of a data packet without altering the content of 
the header. Thus, the rejection is improper and should be withdrawn. 

Claims 3-6, 9, 13-15 stand rejected under 35 U.S.C. 103(a) as being unpatentable over 
Benayoun in view of Fan and ftirther in view of Chui. These claims depend fi-om independent 
claims and are considered to be allowable for the reasons advanced above, and for the additional 
reason that the added subject matter thereof is not taught or suggested by the prior art of record. 

In view of the above, it is believed this application is in condition for allowance, and such 
a Notice is respectfiiUy solicited. 
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To the extent necessary, Applicant petitions for an extension of time under 37 C.F.R. 
1.136. Please charge any shortage in fees due in connection with the filing of this paper, 
including any missing or insufficient fees under 37 C.F.R. 1.17(a), to Deposit Account No. 
50-0687, under Order No. 95-512, and please credit any excess fees to such deposit account. 



RespectfiiUy submitted. 



Manelli Denison & Selter, PLLC 



Edward J. Stemberger 
Registration No. 36,017 
Phone (202) 261-1014 




Facsimile (202) 887-0336 



Customer No. 20736 



Date: December 21, 2007 
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1.2 InfiniBand Conceptual Overview 



The InfiniBand"'''^ Architecture Specification describes a first order inter- 
connect technology for interconnecting processor nodes and I/O nodes to 
form a system area network. The architecture is independent of the host 
operating system (OS) and processor platform. 
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HCA = InfiniBand Channel Adapter in processor node 
TCA = InfiniBand Channel Adapter in I/O node 



1.2.1 The Problem 



Figure 1 IBA System Area Network 



Existing interconnect technologies have failed to keep pace with computer 
evolution and the increased burden imposed on data servers, application 
processing, and enterprise computing created by the popular success of 
the internet. High-end computing concepts such as clustering, fail-safe, 
and 24x7 availability demand greater capacity to move data between pro- 
cessing nodes as well as between a processor node and I/O devices. 
These trends require higher bandwidth and lower latencies, they are 
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1,2.2 Features 



1,2,3 Benefits 



pushing more functionality down to the I/O device, and they are de- 
manding greater protection, higher isolation, deterministic behavior, and a 
higher quality of service than currently available. 



InfiniBand"'"'^ Architecture (IBA) is designed around a point-to-point, 
switched I/O fabric, whereby end node devices (which can range from 
very inexpensive I/O devices like single chip SCSI or ethernet adapters to 
very complex host computers) are interconnected by cascaded switch de- 
vices. The physical properties of the IBA interconnect support two pre- 
dominant environments, with bandwidth, distance and cost optimizations 
appropriate for these environments: 

• Module-to-module, as typified by computer systems that support 
I/O module add-in slots 

• Chassis-to-chassis, as typified by interconnecting computers, ex- 
ternal storage systems, and external LAN/WAN access devices 
(such as switches, hubs, and routers) in a data-center environ- 
ment. 

The IBA switched fabric provides a reliable transport mechanism where 
messages are enqueued for delivery between end nodes. In general, 
message content and meaning is not specified by InfiniBand Architecture, 
but rather is left to the designers of end node devices and the processes 
that are hosted on end node devices. IBA defines hardware transport pro- 
tocols sufficient to support both reliable messaging (send/receive) and 
memory manipulation semantics (e.g. remote DMA) without software in- 
tervention in the data movement path. IBA defines protection and error 
detection mechanisms that permit IBA transactions to originate and termi- 
nate from either privileged kernel mode (to support legacy I/O and com- 
munication needs) or user space (to support emerging interprocess 
communication demands). 

The IBA Specification also addresses the need for a rich manageability in- 
frastructure to support interoperability between multiple generations of 
IBA components from many vendors over time. This infrastructure pro- 
vides ease of use and consistent behavior for high volume, cost sensitive 
deployment environments. IBA also specifies interfaces for industry stan- 
dard management that interoperate with enterprise class management 
tools for configuration, asset management, error reporting, performance 
metric collection, and topology management necessary for data center 
deployment of IBA. 



For all of the revolutionary aspects of IBA, the architecture has been care- 
fully designed to minimize disruption of prevailing market paradigms and 
business practices. By simultaneously supporting board and chassis in- 
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tercx)nnections, it is expected that vendors are able to adopt InfiniBand Ar- 1 

chitecture technology for use in future generations of existing products, 2 

within current business practices, to best support their customers needs. 3 

4 

IBA can support bandwidths that are anticipated to remain an order of 

magnitude greater than prevailing I/O media (SCSI, Fibre Channel, ^ 

Ethernet). This ensures its role as the common interconnect for attaching ^ 

I/O media using these technologies. Reinforcing this point is IBA's native 7 

use of IPv6 headers, which supports extremely efficient junctions between 8 

IBA fabrics and traditional internet and intranet infrastructures. 9 



10 
11 
12 



IBA supports implementations as simple as a single computer system, 
and can be expanded to include: replication of components for increased 
system reliability, cascaded switched fabric components, additional I/O 
units for scalable I/O capacity and performance, additional host node 13 
computing elements for scalable computing, or any combinations thereof. 1 4 
InfiniBand Architecture is a revolutionary architecture that enables com- 1 5 
puter systems to keep up with the ever increasing customer requirement 
for increased scalability, increased bandwidth, decreased CPU utilization, 
high availability, high isolation, and support for Internet technology. 

1 8 

Being designed as a first order network, IBA focuses on moving data in ''9 
and out of a node's memory and is optimized for separate control and 20 
memory interfaces. This permits hardware to be closely coupled or even 21 
integrated with the node's memory complex, removing any performance 22 
barriers. IBA is flexible enough to be implemented as a second order net- 23 
work permitting legacy and migration. Even when implemented as a ^4 
second order network, IBA's memory optimization operation permits max- 
imum available bandwidth utilization and increases CPU efficiency. 

26 

1.3 Scope 27 

IBA supports a range of applications from being the backplane intercon- 
nect of a single host, to a complex system area network consisting of mul- 29 
tiple independent and clustered hosts and I/O components. 30 

31 

For the single host environments, as depicted in Figure 2, each IBA fabric 32 
serves as a private I/O interconnect for its host and provides connectivity 
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between the host's CPU/memory complex and a number of I/O modules. 
For this environment, all devices are dedicated to the host. 



Processor Node 




CPU CPU ooo CPU 
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Figure 2 Single Host Environment 

On the other end of the scale is multiple host connectivity as depicted in 
Figure 1 . Here a single fabric or even multiple fabrics interconnect nu- 
merous hosts and various I/O units. Some hosts might share I/O devices 
and others do not. Interprocess communication between hosts becomes 
a very significant objective. Trivial fabric management is no longer suffi- 
cient as network administrators desire additional features to maintain sep- 
aration and assure deterministic behavior. 

The architecture not only specifies the mechanisms for I/O and interpro- 
cess communication, but it also specifies an extensive set of management 
mechanisms that are flexible enough to permit single host environments 
with out undue burden and costly fabric managers and at the same time 
support very complex system area networks (SAN) and feature rich fabric 
management. 



1.4 Document Organization 
1.4.1 Series OF Volumes 



There are 3 volumes that comprise the InfiniBand documentation suite as 
follows: 

Volume 1 - specifies the core InfiniBand™ Architecture. It provides nor- 
mative information required for IBA operation for switches, routers, host 
channel adapters for processor nodes, target channel adapters for I/O de- 
vices, and management. 
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Volume 2 - specifies electrical & mechanical configurations. It specifies 1 
requirements for a number of different physical media and signaling rates, 2 
defines mechanical form factors, and specifies physical and chassis man- 3 
agement requirements. ^ 

Volume 3 - specifies policies, recommended practices, and examples for 
applying IBA - such as booting over an InfiniBand fabric. 6 

7 

1.4.2 Volume 1 Organization 8 

9 

1.5 Audience io 

11 

1.6 Document Conventions i2 

1.6.1 Byte Ordering 13 

This specification uses Big Endian byte ordering. For fields greater than 14 
one byte in size this means that the most significant byte of each field is 1 5 
transmitted first as illustrated in Figure 3. 16 

17 
18 
19 
20 
21 
22 
23 
24 
25 
26 

Unless specifically stated otherwise, the text of this document lists fields ^ 
in the order of transmission. In most cases, multiple byte fields are aligned 
to start or end on a 32-bit boundary. For clarity, certain figures show fields 
ordered in 32 bit words. These words are in big endian format and imple- 29 
mentations targeted for little endian processing need to pay particular at- 30 
tention to byte ordering to assure correct operation since little endian 31 
processing tends to place the least significant bytes in lower byte offsets. 32 
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Figure 3 Byte Order for Multiple Byte Fields 
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1.6.2 Numeric Values 



Figure 4 illustrates how numeric and bit significant fields should be inter- 
preted. 

Figure 4 Byte Order Examples 



bits 


b7 


bO 


b7 bO 


b7 bO 


b7 bO 


Byte 
Offset 


Byte 0,4,8.. 


Byte 1,5.9,... 


Byte 2,6,10,... 


Byte 3,7,11,... 


0-3 


DlO 


16-bit field bO 


b15 16-bit field bO 


4-7 


b31 




32-bit field 


bO 


8-11 


b7 


1-byte bO 


|b23 


24-bit field 


bO 


12-15 


b23 




24-bit field 


bO 


|b7 1-byte bO 


16-19 


b47 




48-blt field (high) 


b16 


20-23 


b15 


48-bit field (low) bO 


b47 48-bit field (high bytes) b32 


24-27 


b31 




48-btt field (low bytes) 


bO 


28-31 


b63 




64-bit field (high bytes) 


b32 


32-35 


b31 




64-bit field (low bytes) 


bO 


36-39 


b127 




128-bit field (highest bytes) 


b96 


40-43 


b95 








b64 


44-47 


b63 








b32 


48-51 


b31 




128-bit field (lowest bytes) 


bO 



Bit fields with other than byte granularity follow the same rules - that is, the 
most significant bits of the field occupies the higher order bits of the lowest 
byte offset with least significant bits being in the lowest byte offset as illus- 
trated in Figure 5.. 

Figure 5 Bit Order Examples 
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5-bit field 

b4 1 b3 1 b2 1 b1 


bO 


3-bit field 

b2 |bl|bO 


2-bit 

bl|bO 


6-bit field 

b5|b4|b3|b2|bl|b0 






4-bit field 

b3 1 b2 1 b1 1 bO 


12-bit field 

b11 |b1Q|b9|b8|b7|b6|b5|b4|b3|b2 


bljbO 






14-bit field 

b13|b12lb1l|b10| b9 | b8 |b7|b6|b5|b4|b3|b2|b1 |bO 


2-bit 

bl|bO 





Unless otherwise stated numerical values without qualifiers are decimal. 
This document uses the following qualifiers: 

• Ox prefixed to a hexidecimal value (e.g., 0x1 5F7) 

• b' prefixed to a binary value (e.g., b'0110) 
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An obvious exception are binary numbers used in figures and tables 1 

2 

In table headings a colon is used to specify a range of bits (e.g. Bits 7:0) 3 
and table values in that column are binary numbers. ^ 

5 
6 

Global IDs are 128-bit values specified in the format : 7 
value:value:value:value:value:value:value:value 8 

Where each value represents a 4-digit hexidecimal number (e.g., 9 
FF02:0:0:0:0:0:0:1) 

11 
12 

Like any document, this specification is subject to errata for correctness, ^3 

clarity, and enhancements. The InfiniBand^'^ Trade Association hosts a 
web site at http://www.lnfiniBandTA.org. Please visit this site to check for ^ ^ 
errata and updates to this specification. 

ID 
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Chapter 2: Glossary 



Active 

Address Handle 

Address Vector 

AETH 
AM 

Asynchronous error 



Attribute 

Automatic Path Migration 



B_Key 
Base LID 

Baseboard Managed Unit 



Baseboard Management 
Key 



Binding 



BTH 



Describes an entity initiating a communication establishment request 
(e.g., TCP CONNECT). 

An object that contains the information necessary to transmit messages 
to a remote port over Unreliable Dataoram service. 

A collection of address and Path information specifying a remote port and 
the parameters to be used when communicating with it. 

Acl< Extended Transport Header 

Attribute Modifier. 

A permanent error that cannot be reported through immediate or comple- 
tion error handling mechanisms at the local end. Asynchronous errors 
may be unaffiliated or may be afTiliated with a specific Completion 
Queue . End to End Context , or Queue Pair . 

The collection of management data carried in a Management Datagram . 

The process in which a Channel Adaoter . on a per- Queue Pair basis, sig- 
nals another CA to cause Path Migration to a preset alternate Path . Auto- 
matic Path Migration uses a bit in a request or response packet (MIgReq) 
to signal the other channel adapter to migrate to the predefined alternate 
path. 

See Baseboard Management Key . 

The numerically lowest Local Identifier that refers to a Port . 

Any Unit which provides InfiniBand™ specification defined Information 
about itself by a Baseboard method MAD operation through the Infini- 
Band™ linl<. 

A construct that is contained in IBA management datagrams to authenti- 
cate that the sender is allowed to perform the requested operation. 

The act of associating a virtual address range in a specified Memory 
Region with a Memory Window . 

Base Transport Header. 
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Channel 


The association of two queue pairs for communication. 


Channel Adapter 


Device that terminates a link and executes transport-level functions. 




One of Host Channel Adapter or Taraet Channel Adaoter. 


v^iidiiiit;! Ill ic 1 1 ci i^xs 


Thp rirp^pntatinn of thp channel to the Verb*^ Consumer as imnlemented 

1 1 1 w wJl wwwl 1 iCI LIVJI 1 \Jl 11 1^ Wl ICII II Iwl ill^ V^IU^ x,/WI 1 wU 1 1 IGt CIO III lla^l wl 1 Iwl 1 




through the combination of the Host Channel Adapter, associated firm- 




ware and devire driver software 

vvcii^, Cli iU vi^vii./^ vjiiv^i oviivvc*iw> 


PKonnol l9olisil^lo f^stfinrfim 
^riannifl, r\t;llaUlc l^aldyiaill 


Qpp Rpli^hlp natanram f^hannel 

OC?\7 rxcilclUIC UaKay^lQUl wiiaiiiid. 


CI 


See Channel Interface. 


Client 


The active entity in an active/passive communication establishment 




CAL^I Idl i^C 


CM 


See Cnmmunination Manaaer 

II II 1 1 U 1 11 VyCI 11 Wl 1 iVICIIIVlMwI. 


CME 


Chassis Management Entity. 


Communication Manager 


The software, hardware, or combination of the two that supports the 




communication management mechanisms and protocols. 


Completion Error 


Permanent interface or processing error reported through completion 




status. 


V^OmpiollOil ValUcUt? 


A niiPMp pontaininn one nr mnrp Cinmnletion Oueue Entries Comoletion 

r\ UUCLIC? 1 ICI II III lU V./I Iw \Jt Hl\J\^ I tyji^KiKJt I vkUwU^ ^iliii^w. Vi/wi 1 i^i\^ki\/i 1 




Queues are internal to the Channel Interface, and are not visible to verb 




rnnmjmprs 

wwl IwUI 1 Iwl O. 


Completion Queue Entry 


The Channel Interface-internal representation of a Work Completion. 


Connection 


An association between a pair of entities (e.g., processes) over one or 




mriTfi r^hannplc 
iiiuic? wMaiiiicio. 


r^miQi imor 

wi/l lOUl lid 


Qpp Vprh^ Cirin^ijmpr 


COE 


nomnlptinn Oueup Fntrv rrjrnrnonlv nronounced "cookie" 

V^WI 1 1 ^iOllwl 1 VkU W Ur, I 1 11 V t OUI lllllwllly Ik^lv/llwUI 1 vWi/IMw * 


CRC 


Cyclic Redundancy Check. 


Data Payload 


The data, not including any control or header information, carried in one 




packet. 


Data Segment 


A tuple in a Work Request that specifies a virtuallv contiquous buffer for 




Host Channel Adapter access. Each Data Seqment consists of a Virtual 
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Addrp^^ an a^<;nrlatpd 1 oral Kpv or Rpmote Kev and a lenath 


nPTH 


riatanrflm pYtpnHpd TranQnnrt HpaHpr 




riPQtinafinn f^lnh^illw 1 IniniiP IHpnfifipr 


ni in 


r^pctinatinn 1 nnal Idpnfifipr 




Qoo FnH tn FnH r^nntpvt 

OC^C L.I lU iVJ ^1 lU wLIIILcAl. 




Qap FnH ick FnH OnntPYt Wiinnhpr 




^pp FnH to FnH (^nntpYt 


FnH FnH f^^nfovt 


Thfi pnHnnint nf a F^pliaKIp riatanram phannpl 

) lie CIIU^Wllll Ul a rSXSUcLUlXS LyCllCI^I Ol l l LrllCiiiiid. 


End to End Context Number 


Identifies a specific End to End Context within a Channel Adaoter. 


End to End Flow Control 


A mechanism to prevent a sender from transmitting messages during pe- 




riods when receive buffers are not posted at the recipient. 


Fabric 


The collection of Links. Switches, and Routers that connects a set of 




f^hannpl AHantpr^ 


Gh/c; 


Ciina-hit^ npr <?prond MO^ hit^ ner second^ 


GB/s 


Giga-bytes per second (10^ bytes per second) 


General Service Interface 


An interface providing management services (e.g., connection, perfor- 




mannp HiannoQtinQ\ nfhpr than Qiihnpt mananpmpnt 

IMClllUC, UlCl^l ILfoLIUoy VJll Id 11 ICII 1 OUUIIC71 1 1 lal ICI^C71 1 IC;i 11. 


GID 


See . 


Global Identifier 


A 128-bit identifier used to identify a port on a channel adapter, a port on 




arniitpr nr a mi iltipa^t nrniio (^ID^ arp valid 1 9R-hit IPvfi aHdrP9^P<% ^npr 




RFC 2373) with additional properties / restrictions defined within IBA to fa- 




cilitate efficient discovery, communication, and routing. 


Global Route Header 


Routing header present in InfiniBand^'^ Architecture packets targeted to 




HpQtination^ niit^iHp thp ^pnHpr*^ loral mit^npt 




A niimhpr that iininiipK/ iHpntifiPQ a Hpvipp or oomoonpnt 
r\ iiuiiiud iiicii uiii^uciy ivjdiiMico d ucviLrC? \J\ uui 1 i|jui Id 11. 




o6n6ral IVIandgcii icIU r duJ\cl. 


GRH 


See . 


GSI 


See General Service Interface. 
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GUID 

HCA 

Host 

Host Channel Adapter 

IBA 

IB-ML 

ICRC 

Immediate Data 

Immediate Error 
Initiator 
Interface Error 
Invalid Key 
Invariant CRC 

I/O 

I/O Controller 
I/O Unit 

I/O Virtual Address 

IOC 
lOU 



See Globally Unique Identifier . 
See Host Channel Adapter . 

One or more Host Channel Adapter s govemed by a single memory/CPU 
complex. 

A Channel Adapter that supports the interface. 
InfiniBand™ Architecture. 
InfiniBand™ Management Link. 
See Invariant CRC . 

Data contained in a Work Queue Element that is sent along with the pay- 
load to the remote Channel Adapter and placed in a Receive Work Com- 
pletion . 

A permanent interface error reported through the verb status. 
The source of requests. 

An error due to an invalid field in a Work Reouest . 
See Key. 

A CRC covering the fields in a packet that do not change from the source 
to the destination. 

Input/Output. 

One of the two architectural divisions of an I/O Unit . An I/O controller 
(IOC) provides I/O services, while a Target Channel Adapter provides 
transport services. 

An I/O unit (lOU) provides I/O service(s). An I/O unit consists of one or 
more I/O Controller s attached to the fabric through a single Taroet Chan- 
nel Adapter . 

An address has no direct meaning to the Host processor and is intended 
for use only in describing a Local or Remote memory buffer to the Host 
Channel Adapter . 

See I/O Controller . 

See I/O Unit. 
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IPv6 

IPv6 Address 
Key 



L_Key 
LID 

LID Mask Control 

Link 

LMC 

Local Identifier 



Local Key 



Local Route Header 



Local Subnet 



LRH 



Internet Protocol, version 6 

A 128-bit address constructed in accordance with IETF RFC 2460 for 
IPv6. 

A construct used to limit access to one or more resources, similar to a 
password. The following keys are defined by the InfiniBand™ Architec- 
ture: 

Baseboard Management Kev 
Local Kev 
Management Kev 
Queue Kev 
Partition Kev 
Remote Kev 
See Local Kev . 

See Local Identifier . 

A per-port value assigned by the Subnet Manager . The value of the LMC 
specifies the number of Path Bits in the Local Identifier . 

A full duplex transmission path between any two network fabric ele- 
ments, such as Channel Adapter s or Switch es. 

See LID Mask Control . 

An address assigned to a port by the Subnet Manager , unique within the 
subnet, used for directing packets within the subnet. The Source and 
Destination LIDs are present in the Local Route Header . A Local Identi- 
fier is formed by the sum of the Base LID and the value of the Path Bits . 

An opaque object, created by a verb, referring to a Memorv Region , used 
with a Virtual Address to describe authorization for the HCA hardware to 
access local memory. It may also be used by the HCA hardware to iden- 
tify the appropriate page tables for use in translating virtual to physical 
addresses. 

Routing header present in all InfiniBand™ Architecture packets, used for 
routing through switches within a subnet 

The collection of links and Switch es that connect the Channel Adapter s 
of a particular subnet. 

See Local Route Header. 
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M_Key 
MAD 

Managed Unit 
Management Datagram 



Management Key 

Maximum Transfer Unit 
MB/s 

Memory Protection At- 
tributes 

Memory Region Handle 
Memory Region 
Memory Registration 



Memory Window 



Message 



Message-Level Flow Con- 
trol 



See Management Key . 
See Management Datagram . 

A Unit which provides VPD information about itself to an extemal entity, 
and is managed by that entity. 

Refers to the contents of an unreliable datagram packet used for commu- 
nication among the HCAs, switches, routers, and TCAs to manage the 
networl<. InfiniBand™ Architecture describes the format of a number of 
these management commands. 

A construct that is contained in IBA management datagrams to authenti- 
cate the sender to the receiver. 

See Path Maximum Transfer Unit . 

Mega-bytes per second (10® bytes per second) 

The access rights granted to Memorv Region s. 



An opaque object returned to the consumer when the consumer registers 
a Memorv Region . The Memory Region Handle is used to specify the 
registered region to the memory management verbs. 

A virtually contiguous area of arbitrary size within a Consumer's address 
space that has been registered , enabling HCA local access and optional 
remote access. 

The act of registering a host Memorv Region for use by a consumer. The 
memory registration operation returns a Memorv Region Handle . The 
process provides this with any reference to a virtual address within the 
memory region. 

An allocated resource that enables remote access after being bound to a 
specified area within an existing Memorv Region . Each Memory Window 
has an associated Window Handle , set of access privileges, and current 
R_Key. 

A transfer of information between two or more Channel Adapter s that 
consists of one or more packets. 

See End to End Flow Control. 
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Message Sequence 
Number 



Modifiers 

MSN 
MTU 

Multicast 

Multicast Identifier 

Multicast Group 

Node 

NQ 

Out-of-band Management 



Outstanding 



P_Key 
Packet 

Packet Payload 

Packet Sequence Number 
Partition 



A value returned as part of an acknowledgement by the responder to the 
requestor, indicating the last message completed. Contrast Packet Se- 
cuence Number . 

In a verb definition, the list of input and output objects that specify how, 
and on what, the verb is to be executed. 

See Message Sequence Number . 

Maximum Transfer Unit, see Path Maximum Transfer Unit . 

A facility by which a packet sent to a single address may be delivered to 
multiple ports. 

An identifier for a set of ports making up a Multicast Group , typically 
belonging to different Channel Adapter s. On a subnet. Multicast Identifi- 
ers share the address space of Local Identifier s. 

A collection of Channel Adapter ports that receive Multicast packets sent 
to a single address. 

An overloaded term, used to refer to: a Channel Adapter . Switch , or 
Router : a 

Notification Queue. 

Management messages which traverse a transport other than the Infini- 
Band'^" fabric. 

1 ) The state of a Work Request after it has been posted on a Work 
Queue , but before the retrieval of the Work Completion by the con- 
sumer. 

2) The state of a packet that has been sent onto the fabric but has not 
been acknowledged. 

See Partition Kev . 

The indivisible unit of IBA data transfer and routing, consisting of one or 
more headers, a Packet Pavload . and one or two CRC s. 

The portion of a Packet between (not including) any Transport header(s) 
and the CRCs at the end of each packet. The packet payload contains up 
to 4096 bytes. 

A value carried in the IBA Base Transport Header that allows the detection 
and re-sending of lost packets. 

A collection of Channel Adapter ports that are allowed to communicate 
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Partition Key 



Partition Key Table 

Partition Key Table Index 
(P_KeyJx) 

Partition Manager 

Partition Membership Type 



Passive 



Path 



Path Bits 



Path Maximum Transfer 
Unit 



Path Migration 
PD 



with one another. Port s may be members of multiple partitions simulta- 
neously. Ports in different partitions are unaware of each other's pres- 
ence insofar as possible. 

A value carried in packets and stored in Channel Adaoter s that is used to 
determine membership in a partition. 

Default Partition Key: A partition key special value providing Full mem- 
bership in the default partition. See Partition Membership Type . 

Invalid Partition Key: A special value that indicates that the Partition 
Key Table entry does not contain a valid key. 

A table of partition keys present in each Port . 

An index into the partition key table. 

The entity that manages partition keys and membership. 

The high-order bit of the partition key is used to record the type of mem- 
bership in an Port 's partition table: 0 for Limited, 1 for Full. Limited mem- 
bers cannot accept information from other Limited members, but 
communication is allowed between every other combination of member- 
ship types. 

Describes an entity waiting to receive a communication establishment 
request (e.g., TCP LISTEN). 

The collection of links, switches, and routers a message traverses from a 
source Channel Adapter to a destination channel adapter. Within a sub- 
net, a path is defined by the tuple < SLID . DUD, SL>. 

The portion of a Local Identifier that may be changed to vary the Path 
through the subnet to a particular Port . If the Path Bits are zero, the Local 
Identifier Is equal to the Base LID . The number of Path Bits applicable to 
a particular port is specified by the Subnet Manager through the LID Mask 
Control value. 

The maximum size of the Packet Payload supported along a Path from 
source to destination. PMTU is described in terms of the payload size, 
and may be 256, 512, 1024, 2048, or 4096 bytes. 

The modification of the Path used by a connection. 

See Protection Domain. 
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Peer 

Pinning memory 

PIVI 

PIVITU 

Port 



Post 

Private Data 
Processing Error 



Protection Domain 

PSN 
Q_Key 
QoS 
QP 

Quality of Service 
Queue Key 



Queue Pair 



1 ) One of the agents in an active/active connection establishment ex- 
change. 

2) A generic term for the entity at the other end of a connection. 

A function supplied by the OS which forces the memory region to be res- 
ident and keeps the vlrtual-to-physical translations constant from the 
HCA point of view. 

See Partition Manager . 

See Path Maximum Transfer Unit . 

Location on a Channel Adapter or Switch to which a link connects. There 
may be multiple ports on a single Channel Adapter , each with different 
context information that must be maintained. Switch es/switch elements 
contain more than one port by definition. 

To place a Work Request on a Work Queue . 

A field present in Communication Management messages that is opaque 
at all IBA layers. Consumers may use this field to "piggy-back" additional 
information over the CM message exchange. 

A processing error is an error that occurs when the Host Channel 
Adapter is performing the unit of work described by the Work Queue Ele- 
ment and is unable to complete the request successfully due to an error 
that is returned by the transport protocol. 

A mechanism for associating Queue Pair s. Address Handle s. Memory 
Window s, and Memorv Region s. 

See Packet Sequence Number . 

See Queue Key . 

See Quality of Service . 

See Queue Pair . 

Metrics that predict the behavior, reliability, speed, and latency of a given 
network connection. 

A construct that is used to validate a remote sender's right to access a 
local Receive Queue for the Unreliable Datagram and Reliable Datagram 
service types. If the Q_Key present in an incoming packet does not 
match the value stored in the receiving QP, the packet shall be dropped. 

Consists of a Send Work Queue and a Receive Work Queue. Send and 
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Queue Pair Context 

Queue Pair Handle 

Queue Pair Number 
R_Key 

Raw Datagram 

RC 

RD 

RDC 

ROD 

RDETH 

RDMA 

Receive Queue 

Region Handle 
Registered Memory 
Registration 

Registered memory region 
Reliable Connection 



receive queues are always created as a pair and remain that way 
throughout their lifetime. A Queue Pair is identified by its Queue Pair 
Number . 

The information that pertains to a particular Queue Pair , such as the cur- 
rent Work Queue Element s. Packet Sequence Number s, transmission 
parameters, etc. 

An opaque object that refers to a specific Queue Pair . A Queue Pair Han- 
dle is returned by the operation that creates the QP and is supplied as an 
identifying parameter for other QP operations. 

Identifies a specific Queue Pair within a Channel Adapter 

See Remote Kev . 

A packet that contains an IBA Local Route Header , may contain an IBA 
Global Route Header , but does not contain an IBA Transport header, and 
is not handled by IBA transport services. 

See Reliable Connection . 

See Reliable Dataoram . 

See Reliable Datagram Channel . 

See Reliable Datagram Domain . 

Reliable Datagram Extended Transport Header. 

See Remote Direct Memorv Access . 

One of the two queues associated with a Queue Pair . The receive queue 
contains Work Queue Element s that describe where to place incoming 
data. 

See Memorv Region Handle . 

A region of memory that has been through Memorv Registration . 
See Memorv Registration . 
See Memorv Region . 

A Transport Service Tvpe in which a Queue Pair is associated with only 
one other QP, such that messages transmitted by the send queue of one 
QP are reliably delivered to receive queue of the other QP. As such, 
each QP is said to be "connected" to the opposite QP. 
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Reliable Datagram 



Reliable Datagram Channel 



Reliable Datagram Domain 



Remote Direct Memory Ac- 
cess 

Remote Key 



Retired 

RNR Nak 

Router 

SA 

SAR 

Send Queue 
Server 

Service ID 



A Transport Service Type in which a Queue Pair may communicate with 
multiple other QPs over a Reliable Datagram Channel . A message 
transmitted by an RD QP's send queue will be reliably delivered to the 
receive queue of the QP specified in the associated Work Request . 
Despite the name, Reliable Datagram messages are not limited to a sin- 
gle packet. 

The association of two Reliable Datagram End to End Context s. A Reli- 
able Datagram channel may multiplex Reliable Datagrams from many 
RD Queue Pairs. 

An association that defines which Reliable Datagram Queue Pair s may 
use an End to End Context . 

Method of accessing memory on a remote system without interrupting 
the processing of the CPU(s) on that system. 

An opaque object, created by a verb, referring to a Memorv Region or 
Memorv Window , used with a Virtual Address to describe authorization 
for the remote device to access local memory. It may also be used by the 
HCA hardware to identify the appropriate page tables for use in translat- 
ing virtual to physical addresses. 

The state of a Work Queue Element after the Host Channel Adapter 
completes the operation specified by the WQE, but before the Work 
Completion has been presented to the consumer. 

Receiver Not Ready A response signifying that the receiver is not cur- 
rently able to accept the request, but may be able to do so in the future. 

A device that transports packets between IBA subnets. 

See Subnet Adminstration . 

Segmentation and Re-assembly 

One of the two queues of a Queue Pair . The Send queue contains WQEs 
that describe the data to be transmitted. 

1) The passive entity in a connection establishment exchange. 

2) An entity (e.g., a process) that provides services in response to re- 
quests from clients. 

A value that allows a Communication Manager to associate an incoming 
connection request with the entity providing the service. The Service ID 
is similar to the TCP Port Number. 
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Service Level 



Service Type 
Signaled Completion 

SGID 

SLID 

SL 

SM 

SMA 

SMP 

Solicited Event 
Subnet 

Subnet Adminstration 
Subnet Manager 



Subnet Management Agent 



Value in the Local Route Header identifying the appropriate Virtual Lane 
for a packet, enabling the implementation of differentiated services. While 
the appropriate VL for a specific Service Level may differ over a packet's 
Path , the Service Level remains constant. 

See Transport Service Type . 

A modifier used for Work Requests submitted to the Send Queue speci- 
fying that a Work Completion shall be generated when the work 
requested completes, whether successfully or in error 

Source Global Identifier . 

Source Local Identifier 

See Service Level . 

See Subnet Manaoer . 

See Subnet Management Agent . 

See Subnet Management Packet 

A facility by which a message sender may cause an event to be generated 
at the recipient when the message is received. 

A set of Infiniband™ Architecture Port s, and associated links, that have a 
common Subnet ID and are managed by a common Subnet Manager . 
Subnets may be connected to each other through routers. 

The architectural construct that implements the interface for querying and 
manipulating subnet management data. 

One of several entitles involved in the configuration and control of the 
subnet 

Master Subnet Manager: The subnet manager that Is authoritative, 
that has the reference configuration information for the subnet. 

Standby Subnet Manager: A subnet manager that Is currently quies- 
cent, and not In the role of a master SM, by agency of the master SM. 
Standby SMs are dormant managers. 

An entity present In all IBA Channel Adapter s and Switche s that pro- 
cesses Subnet Management Packet s from Subnet Manager (s). 



Subnet Management Data Vital Product Data required by the Subnet Manager . 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 51 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Glossary 



October 24, 2000 
FINAL 



Subnet Management 
Packet 



Switch 
TCA 

Target Channel Adapter 
Transport Service Type 



UC 
UD 

Unicast 
Unit 

Unreliable Connection 



Unreliable Datagram 



Unsignaled Completion 



The subclass of Management Datagram s used to manage the subnet. 
SMPs travel exclusively over Virtual Lane 15 and are addressed exclu- 
sively to Queue Pair Number 0. 

A device that routes packets from one link to another of the same Sub- 
net , using the Destination Local Identifier field in the Local Route Header 

See Target Channel Adapter . 

A Channel Adapter typically used to support I/O devices. TCAs are not 
required to support the Verbs interface. See also I/O UniL 

Describes the reliability, sequencing, message size, and operation types 
that will be used between the communicating Channel Adapter s. 

Transport service types that use the IBA transport: 

• Reliable Connection 

• Unreliable Connection 

• Reliable Datagram 

• Unreliable Datagram 

Raw Datagram service does not use the IBA transport. 

See Unreliable Connection . 
See Unreliable Datagram . 

An identifier for a single port. A packet sent to a unicast address is deliv- 
ered to the port identified by that address. 

One or more sets of processes and/or functions attached to the fabric by 
one or more channel adapters. See Host and I/O Unit . 

A Transport Service Tvpe in which a Queue Pair is associated with only 
one other QP, such that messages transmitted by the send queue of one 
QP are, if delivered, delivered to the receive queue of the other QP. As 
such, each QP is said to be "connected" to the opposite QP. Messages 
with errors are not retried by the transport, and error handling must be 
provided by a higher level protocol. 

A Transport Service Tvpe in which a Queue Pair may transmit and 
receive single-packet messages to/from any other QP. Ordering and 
delivery are not guaranteed, and delivered packets may be dropped by 
the receiver 

A modifier used for Work Request s submitted to the Send Queue signify- 
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Variant CRC 

VCRC 
Verbs 

Verbs Consumer 
Virtual Lane 

Vital Product Data 

VL 

VPD 

WC 

Window Handle 
Work Completion 

Work Queue 

Work Queue Element 

Work Queue Pair 
Work Request 

WQ 
WQE 
WQP 
WR 



ing tfiat a Work Completion is to be generated only if tlie requested 
action completes in error. 

A CRC covering all the fields of a packet, including those that may be 
changed by Switch es. 

See Variant CRC . 

An abstract description of the functionality of a Host Channel Adapter . An 
operating system may expose some or all of the verb functionality 
through its programming interface. 

The direct user of the Verbs . 

A method of providing independent data streams on the same physical 
link. 

Device-specfic data to support management functions. 
See Virtual Lane . 
See Vital Product Data . 
See Work Completion . 

An opaque object that identifies a Memorv Window . 

The consumer-visible representation of a Completion Queue Entry . A 
Work Completion may be obtained when a consumer polls a Completion 
Queue . 

One of Send Queue or Receive Queue . 

The Host Channel Adapter 's interna! representation of a Work Request . 
The consumer does not have direct access to Work Queue Elements. 

See Queue Pair . 

The means by which a consumer requests the creation of a Work Queue 
Element . 

See Work Queue . 

Work Queue Element , commonly pronounced "wookie". 
See Work Queue Pair . 
See Work Request . 
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CHAPTERS: Architectural Overview i 

2 

3 
4 
5 

This chapter provides a top-down description of the InfiniBand^*^ Architec- g 
ture (IBA) features, capabilities, components, and elements and it de- 
scribes various principles of operation. It is a high level overview intended 
as an informative guide and thus certain details are intentionally excluded 
for the purpose of clarity. ^ 

10 

IBA defines a System Area Network (SAN) for connecting multiple inde- 11 
pendent processor platforms (i.e., host processor nodes), I/O platforms, 12 
and I/O devices (see Figure 6). The IBA SAN is a communications and ^3 
management infrastructure supporting both I/O and interprocessor com- 
munications (IPC) for one or more computer systems. An IBA system can 
range from a small server with one processor and a few I/O devices to a 
massively parallel supercomputer installation with hundreds of processors 
and thousands of I/O devices. Furthermore, the internet protocol (IP) 17 
friendly nature of IBA allows bridging to an internet, intranet, or connection 1 g 
to remote computer systems. ^ g 



15 
16 



20 
21 



IBA defines a switched communications fabric allowing many devices to 
concurrently communicate with high bandwidth and low latency in a pro- 
tected, remotely managed environment. An endnode can communicate 22 
over multiple IBA ports and can utilize multiple paths through the IBA 23 
fabric. The multiplicity of IBA ports and paths through the network are ex- 24 
ploited for both fault tolerance and increased data transfer bandwidth. 25 



26 
27 



IBA hardware off-loads from the CPU much of the I/O communications op 
eration. This allows multiple concurrent communications without the tradi 
tional overhead associated with communicating protocols. The IBA SAN 
provides its I/O and IPC clients zero processor-copy data transfers, with 29 
no kernel involvement, and uses hardware to provide highly reliable, fault 30 
tolerant communications. 31 
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Processor Nods' 



Processor Node 



Processor Node 



I CPU I ooo I CPU I 

I i 



'RAID Subsystem 




Fabric 
(Subnet) 



other IB Subnets 

WANs 

LANs 

Processor Nodes 



SCSI^ 



Ethernet 



net ' < 



, -QQ 

Storage 
Subsystem ~^^QQ 



Video 



Fibre Channel 
hub & FC 
devices 



Graphics 



HCA = InfiniBand Channel Adapter in processor node 
TCA = InfiniBand Channel Adapter in I/O node 



3.1 Architecture Scope 



Figure 6 IBA System Area Network 

An IBA System Area Network consists of processor nodes and I/O units 
connected through an IBA fabric made up of cascaded switches and 
routers. 

10 units can range in complexity from single ASIC IBA attached devices 
such as a SCSI or LAN adapter to large memory rich RAID subsystems 
that rival a processor node in complexity. 



This volume of the InfiniBand Architecture Specification defines the inter- 
connect fabric, routing elements, endnodes, management infrastructure, 
and the communication formats and protocols. It does not specify I/O 
commands or cluster services. 
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For example, consider an IBA SCSI adapter. IBA does not define the disk 
I/O commands, how the SCSI adapter communicates with the disk, how 
the operating system (OS) views the disk device, nor which node in the 
cluster owns the disk adapter. IBA is an essential underpinning of each of 
these operations, but does not directly define any of them. Instead, IBA 
defines how data and commands can be transported between the I/O 
driver on a processor node and the SCSI adapter. 

IBA handles the data communications for I/O and IPC in a multi-computer 
environment. It supports the high bandwidth and scalability required for 
lO. It caters to the extremely low latency and low CPU overhead required 
for IPC. With IBA, the OS can provide its clients with communication 
mechanisms that bypass the OS kernel and directly access IBA network 
communication hardware, enabling efficient message passing operation. 
IBA is well suited to the latest computing models and will be a building 
block for new forms of I/O and cluster communication. IBA allows I/O units 
to communicate among themselves and with any or all of the processor 
nodes in a system. Thus an I/O unit has the same communications capa- 
bility as any processor node. 

3.1.1 Topologies & Components 

At a high level, IBA serves as an interconnect for endnodes as illustrated 
in Figure 7. Each node can be a processor node, an I/O unit, and/or a 
router to another network. 



Node 



Node 



Node 




Figure 7 IBA Network 



An IBA network is subdivided into subnets interconnected by routers as 
illustrated in Figure 8. Endnodes may attach to a single subnet or multiple 
subnets. 
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1 




14 

Figure 8 IBA Network Components ^ g 

16 

An IBA subnet is composed of endnodes, switches, routers, and subnet 
managers interconnected by links as illustrated in Figure 9. Each IBT de- 18 
vice may attach to a single switch or multiple switches and/or directly with 1 9 
each other''. Multiple links can exist between any two IBT devices. 20 



21 
22 




36 

Figure 9 IBA Subnet Components 37 

38 
39 



1 . Single endnode to endnode connection creates an independent subnet, with 
no connectivity to the remainder of the IBT devices, in which case one of the two 41 
interconnected endnodes functions as the subnet manager for that link. 
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3.2 Communication 
3.2.1 Queuing 



The architecture is optimized for units that contain multiple independent 
processes and threads (consumers) as illustrated in Figure 10. Each 
channel adapter constitutes a node on the fabric. The architecture sup- 
ports multiple channel adapters per unit with each channel adapter pro- 
viding one or more ports that connect to the fabric, in which case the 
processor node appears as multiple endnodes to the fabric. 

( Processor Node 



I ( Corisume ^ ( consume ^ ooo ^Consume^ 

Q Message & Data Service ) 
- - - - ■ • ~ Verbs ^ - - 



Channel Adapter 
(endnode) 



. Port 



ooo 



Port 



ooo 



Channel Adapter 
(endnode) 



Port 



ooo 



Port 



Scope of 
y InfiniBand 
Architecture 



Figure 10 Processor Node 

In a processor node, the message and data service is an OS component 
that is outside the scope of this document. This document specifies the 
semantic interface between the message and data service and a channel 
adapter. This semantic interface is referred to as IBA Verbs. Verbs de- 
scribe the functions necessary to configure, manage, and operate a host 
channel adapter. These verbs identify the appropriate parameters that 
need to be included for each particular function. Verbs are not an API, but 
provide the framework for the OSV to specify the API. 

IBA is architrected as a first order network and as such it defines the host 
behavior (verbs) and defines memory operation such that the channel 
adapter can be located as close to the memory complex as possible. It 
provides independent direct access between consenting consumers re- 
gardless of whether those consumers are I/O drivers and I/O controllers 
or software processes communicating on a peer to peer basis. IBA pro- 
vides both channel semantics (send and receive) and direct memory ac- 
cess with a level of protection that prevents access by non participating 
consumers. 



The foundation of IBA operation is the ability of a consumer to queue up 
a set of instructions that the hardware executes. This facility is referred to 
as a work queue. Work queues are always created in pairs, called a 
Queue Pair (QP), one for send operations and one for receive operations. 
In general, the send work queue holds instructions that cause data to be 
transferred between the consumer's memory and another consumer's 
memory, and the receive work queue holds instructions about where to 
place data that is received from another consumer. The other consumer 
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is referred to as a remote consumer eyen though it might be located on 
the same node. IBA specifically describes the queuing relationship for a 
Host Channel Adapter (HCA) but not the I/O unit because an I/O unit is 
not necessarily subject to 2"^ and 3^^ party interoperability that is present 
in a host environment (i.e., interoperability between the HCA vendor, the 
OS vendor, and an IHV's I/O driver or an ISV's application using IPC). The 
following describes the HCA queuing model. 

The consumer submits a work request (WR), which causes an instruction 
called a Work Queue Element (WQE) to be placed on the appropriate 
work queue. The channel adapter executes WQEs in the order that they 
were placed on the work queue. When the channel adapter completes a 
WQE, a Completion Queue Element (CQE) is placed on a completion 

queued Each CQE specifies all the information necessary for a work com- 
pletion, and either contains that information directly or points to other 
structures, for example, the associated WQE, that contain the information. 



Consumer 
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Figure 11 Consumer Queuing IVIodel 



Each consumer may have its own set of work queues, each pair of worl< 
queues is independent from the others. Each consumer creates one or 
more completion queues and associates each send and receive queue to 
a particular completion queue. It is not necessary that both the send and 
receive queue of a worl^ queue pair use the same completion queue. 

Because some work queues require an acknowledgment from the remote 
node and some WQEs use multiple packets to transfer the data, the 
channel adapter can have multiple WQEs in progress at the same time, 
even from the same work queue. Thus the order in which CQEs are 

1 . WQEs and CQEs are not architected entities, only the Work Request verbs 
are architected. 
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posted to the completion queue is not detemiinistic except that CQEs for 
the same work queue are normally posted in the order that the corre- 
sponding WQE was posted to the work queued 
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Figure 12 Work Queue Operations 

There are three classes of send queue operations SEND, Remote 
memory Access (RDMA), and MEMORY BINDING. 

• For a SEND operation, the WQE specifies a block of data in the 
consumer's memory space for the hardware to send to the desti- 
nation, letting a receive WQE already queued at the destination 
specify where to place that data. 



1 . Receive completions for reliable datagram service are the exception because 
concurrent reception on multiple EE contexts can result in out of order posting. 
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For an RDMA operation, the WQE also specifies the address in 1 

the rennote consumer's nnemory. Thus an RDMA operation does 2 

not need to involve the receive work queue of the destination'" . 3 

There are 3 types of RDMA operations, RDMA-WRITE, RDMA- . 
READ, and ATOMIC. 

5 

• The RDMA-WRITE operation stipulates that the hardware is g 
to transfer data from the consumer's memory to the remote 
consumer's memory. 



7 
8 

The RDMA-READ operation stipulates that the hardware is to g 
transfer data from the remote memory to the consumer's 
memory. 



21 
22 



10 
11 

• The ATOMIC operation stipulates that the hardware is to per- ^ ^ 
form a read of a remote 64-bit memory location. The target re- 
turns the value read, and conditionally modifies/replaces the 
remote memory contents by writing an updated value back to 
the same location. 15 

• MEMORY BINDING instructs the hardware to alter memory regis- ^ 
tration relationships (see section 10.6.6.2). It associates (binds) a 17 
Memory Window to a specified range within an existing Memory 18 
Region. Memory binding allows a consumer to specify which por- 
tions of registered memory it shares with other nodes (i.e., the 20 
memory a remote node can access) and specifies read and write 
permissions. The result produces a memory key (R_KEY) that 
the consumer passes to remote nodes for their use in their RDMA 
operations. 23 

There is only one receive queue operation and it is to specify a receive 

data buffer. 25 

26 

• A RECEIVE WQE specifies where the hardware is to place data 27 
received from another consumer when that consumer executes a 28 
SEND operation. Each time the remote consumer successfully 29 
executes a SEND operation, the hardware takes the next entry 
from the receive queue, places the received data in the memory 
location specified in that receive WQE, and places a CQE on the 
completion queue indicating to the consumer that the receive op- 32 
eration has completed. Thus the execution of a SEND operation 33 
causes a receive queue operation at the remote consumer. 34 

Normally an RDMA operation does not consume a receive WQE at the 35 

destination, but there is one exception. That is for an RDMA WRITE op- 36 

eration which specifies immediate data. Immediate data is 32 bits of infer- 37 

mation that is optionally provided in a SEND or RDMA WRITE instruction, ^3 
transferred as part of the operation, but instead of writing the immediate 
data to memory, the data is treated as another piece of status information 

1 . RDMA Write with immediate data does involve the destination's receive work 41 
queue. 42 
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3.2.2 Connections 



3.3 Communications Stack 



and returned as a special field of the RECEIVE CQE status. This means 
that an RDMA WRITE with immediate data will consume a RECEIVE 
WQE at the destination. 



IBA supports both connection oriented and datagram service. For con- 
nected service, each QP is associated with exactly one remote consumer. 
In this case the QP context is configured with the identity of the remote 
consumer's queue pair. The remote consumer is identified by a port and 
a QP number. The port is identified by a local ID (LID) and optionally a 
Global ID (GID). During the communication establishment process, this 
and other information is exchanged between the two nodes. 

For datagram service, a QP is not tied to a single remote consumer, but 
rather information in the WQE identifies the destination. A communication 
setup process similar to the connection setup process needs to occur with 
each destination to exchange that information. 



The communication stack for IBA is illustrated in Figure 13. The architec- 
ture provides a number of IBA transactions that a consumer can use to 
execute a transaction with a remote consumer. The consumer posts work 
queue elements (WQE) to the QP and the channel adapter interprets 
each WQE to perform the operation. 
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3.4 IBA Components 



3.4.1 Links & Repeaters 



Figure 13 IBA Communication Stack 

For Send Queue operations, tiie channel adapter interprets tfie WQE, cre- 
ates a request message, segments the message into multiple packets if 
necessary, adds the appropriate routing headers, and sends the packet 
out the appropriate port. 

The port logic transmits the packet over the link where switches and 
routers relay the packet through the fabric to the destination. 

When the destination receives a packet, the port logic validates the integ- 
rity of the packet. The channel adapter associates the received packet 
with a particular QP and uses the context of that QP to process the packet 
and execute the operation. If necessary, the channel adapter creates a re- 
sponse (acknowledgment) message and sends that message back to the 
originator. 

Reception of certain request messages cause the channel adapter to con- 
sume a WQE from the receive queue. When it does, a CQE corre- 
sponding to the consumed WQE is placed on the appropriate completion 
queue, which causes a work completion to be issued to the consumer that 
owns the QP. 



The devices in an IBA system are classified as: 

switches 
routers 

channel adapters 
repeaters 

links that interconnect switches, routers, repeaters, and channel 
adapters 

The management infrastructure Includes: 

• subnet managers 
general service agents 

Links interconnect channel adapters, switches, repeaters, and routing de- 
vices to form a fabric. A link can be a copper cable, an optical cable, or 

printed circuit wiring on a backplane. Repeaters are transparent^ devices 
that extend the range of a link. Volume 2 of InfiniBand Architecture spec- 
ifies link and repeater requirements for various media types as well as de- 

1 . Transparent in the sense repeaters only participate at the physical layer 
protocol level and nodes are not aware of their presence. 
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3.4.2 Channel Adapters 



fines various nnodule form factors for I/O devices. The architecture 
described in Volume 1 is independent of the type of link and the form 
factor. 

Links and repeaters are not directly addressable but the link status can be 
determined via the device on each end of the link. 



Channel adapters are the IBA devices in processor nodes and I/O units 
that generate and consume packets. IBA defines two types of channel 
adapters: Host Channel Adapter (HCA) and Target Channel Adapter 
(TCA). The HCA provides a consumer interface providing the functions 
specified by IBA verbs. IBA does not specify the semantics of the con- 
sumer interface for a TCA. 

A channel adapter is a programmable DMA engine with special protection 
features that allow DMA operations to be initiated locally and remotely. 
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Figure 14 Channel Adapter 

A channel adapter may have multiple ports. Each port of a channel 
adapter Is assigned a Local ID (LID) or a range of LIDs. Each port has its 
own set of transmit and receive buffers such that each port is capable of 
sending and receiving concurrently. Buffering is channeled through virtual 
lanes (VL) where each VL has its own flow control. 

The channel adapter provides a Memory Translation & Protection (MTP) 
mechanism that translates virtual addresses to physical addresses and to 
validate access rights. Specific memory management mechanisms are 
not specified by this document, and requirements for such mechanisms 
are not specified for TCAs. 
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3.4.3 Switches 



The channel adapter provides multiple instances of the communication in- 
terface to its consumer in the form of queue pairs (QP) comprised of a 
send and receive work queue. 

A subnet manager configures channel adapters with the local addresses 
for each physical port, i.e., the port's LID. The entity that communicates 
with the subnet manager for the purpose of configuring the channel 
adapter is referred to as the Subnet Management Agent (SMA). 

Each channel adapter has a globally unique identifier {G[J\D) assigned by 
the channel adapter vender. Since local IDs assigned by the subnet man- 
ager are not persistent (i.e., might change from one power cycle to the 
next), the channel adapter GUID (called Node GUID) becomes the pri- 
mary object to use for persistent identification of a channel adapter. Addi- 
tionally, each port has a Port GUID assigned by the channel adapter 
vender. 



In contrast to channel adapters, switches do not generate nor consume 
packets (except for management packets). They simply pass them along 
based on the destination address in the packet's local route header. 

IBA switches are the fundamental routing component for intra-subnet 
routing (inter-subnet routing is provided by IBA routers). Switches inter- 
connect links by relaying packets between the links. 
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Figure 15 IBA Switch Elements 

Switches expose two or more ports between which packets are relayed. 

Switches are transparent to the endnodes which means they are not di- 
rectly addressed (except for management operations). Instead, packets 
transverse the switch fabric virtually unchanged by the fabric. To this end, 
every destination within the subnet is configured with one or more unique 
local identifiers (LID). From the point of view of a switch, a LID represents 
a path through the switch. Switch elements are configured with forwarding 
tables. Packets contain a destination address that specifies the LID of the 
destination. Individual packets are forwarded within a switch to an out- 
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3.4.4 Routers 



bound port or ports based on the packet's Destination LID and the 
Switch's forwarding table. 

IBA switches support unicast forwarding and may support multicast for- 
warding. Unicast is the delivery of a single packet to a single destination 
and multicast is the ability of the fabric to deliver a single packet to multiple 
destinations. 

A subnet manager configures switches including loading their forwarding 
tables. 

To maximize availability, multiple paths between endnodes may be de- 
ployed within the switch fabric. If multiple paths are available between 
switches, the subnet manager can use these paths for redundancy or for 
destination LID based load sharing. Where multiple paths exists, a subnet 
manager can re-route packets around failed links by re-loading the for- 
warding tables of switches in the affected area of the fabric. 



Like switches, routers do not generate nor consume packets (except for 
management packets). They simply pass them along. Routers forward 
packets based on the packet's global route header and actually replaces 
the packet's local route header as the packet passes from subnet to 
subnet. 

IBA routers are the fundamental routing component for inter-subnet 
routing (intra-subnet routing is provided by IBA switches). Routers inter- 
connect subnets by relaying packets between the subnets. 
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Figure 16 IBA Router Elements 

Routers expose one or more ports between which packets are relayed. 
Routers could be embedded with other devices, such as channel adapters 
or switches. 
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Routers are not completely transparent to the endnodes since the source 1 
must specify the LID of the router and also provide the GID of the desti- 2 
nation. 3 

4 

Each subnet is uniquely identified with a subnet ID known as the Subnet 
Prefix. The subnet manager programs all ports (via the Portlnfo attribute) 
with the Subnet Prefix for that subnet. When combined with a Port GUID, 6 
this combination becomes a port's natural GID. Ports may have other lo- 7 
cally administrated GIDs. 8 



9 

10 
11 



From the point of view of a router, the subnet prefix portion of a GID rep- 
resents a path through the router. IPv6 specifies the protocol performed 
between routers to derive their forwarding tables. Individual packets are 
forwarded within a router to an outbound port or ports based on the 
packet's Destination GID and the router's fon/varding table. ^ 3 

14 

Each router forwards the packet through the next subnet to another router 1 5 
until the packet reaches the target subnet. The last router sends the ^5 
packet using the LID associated with the Destination GID as the Destina- 
tion LID. 

18 

A subnet manager configures routers with information about the subnet 
such as which VLs to use and partition information. 20 

21 

To maximize availability, multiple paths between subnets may be de- 22 
ployed within the fabric. If multiple paths are available, routers might use 23 
those paths for redundancy or for load sharing. Where multiple paths 
exist, a router can re-route packets around failed subnets. 



24 
25 

3.4.5 Management Components 26 

77 

IBA management provides for a subnet manager and an infrastructure 
that supports a number of general management services. The manage- ^8 
ment infrastructure requires a subnet management agent in each node 29 
and defines a general service interface that allows additional general ser- 30 
vices agents. 3^ 



32 
33 



The architecture defines a common management datagram (MAD) mes 
sage structure for communicating between managers and management 
agents. 

35 

3.4.5.1 Subnet Managers 36 

A Subnet Manager (SM) is an entity attached to a subnet that is respon- 37 

sible for configuring and managing switches, routers, and channel 38 

adapters. A SM can be implemented with other devices, such as a 39 

channel adapter or a switch. 4q 

41 
42 
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IBA supports the notion of multiple subnet managers per subnet and 1 

specifies how multiple subnet managers negotiate for one to become the 2 

master SM. It does not prohibit other methods between cooperating SMs 3 

for governing master/standby relationships ^ 

5 
6 

discovers the subnet topology, 7 

configures each channel adapter port with a range of LIDs, GIDs ^ 
subnet prefix, and P_Keys, 9 

configures each switch with a LID, the subnet prefix, and with its 
forwarding database, 

maintains the endnode and service databases for the subnet and 
thus provides a GUID to LID/GID resolution sen/ice as well as a 
services directory. 1 ^ 



3.4.5.3 General Service Agents 



10 
11 
12 



15 
16 
17 



3.4.5.2 Subnet Management Agents 

Each node provides a Subnet Management Agent (SMA) that the SM ac- 
cess through a well known interface called the Subnet Management Inter- 
face (SMI). SMI allows for both LID Routed packets and Directed Routed 

packets. Directed routing provides the means to communicate before 19 

switches and end nodes are configured. Only the SMI allows for directed 20 

routed packets. 21 

22 
23 

Each node may contain additional management agents referred to as 24 

General Service Agents (GSA*) that can be accessed through a well ^5 
known interface called the General Service Interface (GSI). The GSI only 
supports LID routing. The general service classes defined by IBA are: 

27 

Subnet Administration (SubnAdm) - this is a service provided by 28 

the SM that allows nodes to access information about the subnet 29 

to discover other nodes and services, to resolve paths, and to 3Q 

register its services. 2i 

Performance Management (Perf) - monitors and reports well-de- 32 

fined performance counters 33 

Device Management (DevMgt) - provides for management of I/O 34 

devices behind TCAs. 35 

Baseboard Management (BM) - provides for chassis manage- 36 

ment using IB-ML as defined in Volume 2. 37 

• SNMP Tunneling (SNMP) - provides SNMP functionality by defin- 38 
ing the method for sending and receiving SNMP messages. 39 

• Vendor Defined (Vendor) - allows private extensions that a device 40 
vendor may use to remotely configure and manage its devices. 41 

42 
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3.5 IBA Features 
3.5,1 Queue Pairs 



Communication Management (ConMgt) - Provides for connection 
establishment and other communication management functions 
between endnodes. 



Device Configuration (DevConfMgt) 
agement 



Provides I/O resource man- 



The QP is the virtual interface that the hardware provides to an IBA con- 
sumer and it provides a virtual communication port for the consumer. The 
architecture supports up to 2^^ QPs per channel adapter and the opera- 
tion on each QP is independent from the others. Each QP provides a high 
degree of isolation and protection from other QP operations and other 
consumers. Thus a QP can be considered a private resource assigned to 
a single consumer. A consumer might consume multiple QPs as illus- 
trated in Figure 17. 
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Figure 17 Communication Interface 



The consumer creates this virtual communication port by allocating a QP 
and specifying its class of service. Communication takes place between a 
source QP and a destination QP. For connection oriented service, each 
QP is tightly bound to exactly one other QP, usually on a different node. 
The consumer initiates any communication establishment necessary to 
bind the QP with the destination QP and configures the QP context with 
certain information such as Destination LID, service level, and negotiated 
operating limits. 

The consumer posts work requests to a QP to invoke communication 
through that QP. 
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3.5.2 TYPES OF Service 1 

Each QP is configured for a certain class of operation (referred to as ser- 2 

vice type) based on how the sourcing and receiving QPs interact. Both the 3 

source and destination QPs must be configured for the same service type. 4 

Each service type is based on the following attributes. 5 

6 

Connection oriented versus datagram - For connection orient- ^ 
ed service, the QP is associated with exactly one other QP and 

all work requests posted to the QP results in a message sent to ^ 

the established destination QP Datagram operation allows a sin- 9 

gle QP to be used to send and receive messages to/from any ap- 10 

propriate QP on any node. 11 

Acknowledged versus unacknowledged - For acknowledged 12 

service, a QP returns response messages when it receives re- 13 

quest messages. Response messages might be positive ac- 14 
knowledgment (ACK), negative acknowledgment (NAK), or 
contain response data. Acknowledged service is referred to as re- 
\\ab\e since the transport protocol guarantees un-corrupted data 
delivery, in order, exactly once. Unacknowledged service is re- 

ferred to as unreliable because the transport protocol does not 1 8 

guarantee that all data is delivered. It does guarantee that all data 1 9 

is delivered at most once, and delivered data is not corrupted. 20 

Also there are certain cases where changes in fabric configura- 21 

tion might cause data to be delivered out of order. 22 

IBA transport versus other transport - IBA transport services 23 

define a specific transport protocol for channel based and memo- 24 

ry based operations. IBA also supports using the channel adapter 25 
as a data link engine to send raw packets between nodes which 
is useful for supporting legacy protocol stacks and legacy net- 

works. 27 

The service types defined by IBA are specified in Table 2 

29 

Table 2 Service Types 30 

\ 31 

Service Type ^Oriented" Acknowledged Transport 32 

"^"^ 33 

Reliable Connection yes Yes IBA ^4 

Unreliable Connection yes no IBA 35 

Reliable Datagram no Yes IBA 

37 

Unreliable Datagram no no IBA 

38 

RAW Datagram no no Raw 39 

40 
41 
42 
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17 

18 



Certain IBA operations are valid only over certain classes of service. A QP 1 
rejects a WQE for an operation that is not valid for the configured class of 2 
service. 3 

4 

Connection oriented service requires that the consumer initiate a commu- 
nication establishment procedure (connection setup) with the target node 
to associate the QPs and establish QP context prior to any QP operation. 6 
Actually, all service classes except for raw datagram need some form of 7 
communication setup to associate queue pairs. For reliable datagram ser- 8 
vice, the node performs a communication establishment process to asso- g 
date an end-to end (EE) context (explained later) with each target node, 
All QPs configured for Reliable Datagram service use established EE con- ^ ^ 
texts and the work request specifies which EE context to use for that op- 
eration. 

13 

Raw Datagrams are similar to unreliable datagrams, except that the 14 
source QP does not know the identity of the QP that will receive and pro- 1 5 
cess the message. Raw datagrams allow for routers that fonward raw da- ^ g 
tagram packets to non IBA destinations on a disparate fabric (such as a 
LAN or WAN) that has no equivalent of a QP There are two types of raw 
datagrams, IPv6 and Ethertype. IPv6 raw datagrams contain a global 
routing header and the packet payload contains a transport protocol ser- 
vice data unit as identified in the global routing header. An Ethertype raw 20 
datagram contains an Ethernet Type field and the packet payload contains 21 
a transport protocol service data unit as identified in the Ethernet Type 22 
field. 23 

24 

IBA defines both channel (send/receive) and memory (RDMA) semantics. 
Raw datagram and Unreliable Datagram services do not support memory ^5 
semantics. 26 

27 

3.5.3 Keys 28 

IBA uses various keys to provide isolation and protection. Keys are values 29 

assigned by an administrative entity that are used in messages in various 30 

ways. The keys themselves do not provide security since the keys are 3^ 

available in messages that cross the fabric and thus any entity that can ^2 
get to the interior of the fabric can ascertain key values. IBA does place 
restrictions on how applications can access certain keys. 

The keys are: 35 

36 

• Management Key (M__Key): Enforces the control of a master subnet 37 
manager. Administered by the subnet manager and used in certain 33 
subnet management packets. Each channel adapter port has a 
M_Key that the SM sets and then enables. The SM may assign a dif- 
ferent key to each port. Once enabled, the port rejects certain man- 
agement packets that do not contain the programmed M_Key. Thus 

42 
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only a SM with the programed M_Key can alter a node's fabric con- 1 

figuration. The SM can prevent the port's M_Key from being read as 2 

long as the SM is active. The port maintains a time-out such that the 3 

port reverts to an unmanaged state if the SM fails. There is one ^ 
M_Key for a switch. 

5 

Baseboard Management Key (B_Key): Enforces the control of a g 

subnet baseboard manager. Administered by the subnet baseboard ^ 
manager and used in certain MADs. Each channel adapter port has a 

B_Key that the baseboard manager sets. The baseboard manager ^ 

may assign a different key to each port. Once enabled, the port re- 9 

jects certain management packets that do not contain the pro- 10 

grammed B_Key. Thus only a baseboard manager with the n 

programed B_Key can alter a node's baseboard configuration. The 12 
baseboard manager can prevent the port's B_Key from being read as 
long as the baseboard manager is active. The port maintains a time- 
out such that the port reverts to an unmanaged state if the baseboard 
manager fails. There is one B_Key for a switch. 

• Partition Key (P_Key): Enforces membership. Administered through 
the subnet manager by the partition manager (PM). Each channel 
adapter port contains a table of partition keys which is setup by the 
PM. QPs are required to be configured for the same partition to com- 19 
municate (except QPO, QP1 . and ports configured for raw data- 20 
grams) and thus the P_Key is carried in every IB transport packet. 21 
Part of the communication establishment process determines which 22 
P_Key that a particular QP or EEC uses. An EEC contains the P_Key 

for Reliable Datagram service and a QP context contains the P_Key 

for the other IBA transport types. The P_Key in the QP or EEC is 24 

placed in each packet sent, and compared with the P_Key in each 25 

packet received. Received packets whose P_Key comparison fails 26 

are rejected. Each switch has one P_Key table for management mes- 27 

sages and may optionally support partition enforcement tables that 28 

filter packets based on their P_Key. 29 

• Queue Key (Q_Key): Enforces access rights for reliable and unreli- 3Q 
able datagram service (RAW datagram service type not included). 
Administered by the channel adapter. During communication estab- 
lishment for datagram service, nodes exchange Q_Keys for particular 
queue pairs and a node uses the value it was passed for a remote 33 
QP in all packets it sends to that remote QP. Likewise, the remote 34 
node uses the Q_Key it was provided. Receipt of a packet with a dif- 35 
ferent Q_Key than the one the node provided to the remote queue 35 
pair means that packet is not valid and thus rejected. 37 

Q_Keys with the most significant bit set are considered controlled 33 
Q_Keys (such as the GSI Q_Key) and a HCA does not allow a con- 39 
sumer to arbitrarily specify a controlled Q_Key. An attempt to send a 
controlled Q_Key results in using the Q__Key in the QP context. Thus 

42 



31 
32 



InfiniBand^'^ Trade Association 



Page 72 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



7 
8 



InfiniBand™ Architecture Release 1 .0 Architectural Overview October 24, 2000 

Volume 1 - General Specifications FINAL 

the OS maintains control since it can configure the QP context for the 1 

controlled Q_Key for privileged consumers. 2 

• Memory Keys (L_Key and R_Key): Enables the use of virtual ad- 3 

dresses and provides the consumer with a mechanism to control ac- 4 

cess to its memory. These keys are administered by the channel 5 

adapter through a registration process. The consumer registers a re- g 
gion of memory with the channel adapter and receives an L_Key and 
R_Key. The consumer uses the L_Key in work requests to describe 
local memory to the QP and passes the R_Key to a remote consumer 

for use in RDMA operations. When a consumer queues up a RDMA 9 

operation it specifies the R_Key passed to it from the remote con- 10 

sumer and the R_Key is included in the RDMA request packet to the 1 1 

original channel adapter. The R__Key validates the sender's right to 12 

access the destination's memory and provides the destination chan- ^ ^ 

nel adapter with the means to translate the virtual address to a physi- ^ ^ 
cal address. 

15 

3.5.4 Virtual Memory Addresses 

IBA is optimized for virtual addressing. That is, an IBA consumer uses vir- ^ 7 

tual addresses in work requests and the channel adapter is able to con- ^ g 
vert the virtual address to physical address as necessary. For this to 
happen, each consumer registers regions of virtual memory with the 

channel adapter and the channel adapter returns 2 memory handles ^0 

called L_Key and R_Key to the consumer. The consumer then uses the 21 

L_key in each work request that requires a memory access to that region. 22 

See 3.5.3 for description of L__Key usage. 23 

24 

Memory Registration provides mechanisms that allow IBA consumers to ^5 
de-scribe a set of virtually contiguous memory locations or a set of phys- 
ically contiguous memory locations to allow the HCA to access the 

memory as a virtually contiguous buffer using virtual addresses. 27 

28 

IBA also supports remote memory access (RDMA) that permits a remote 29 

consumer to access that registered memory. For RDMA, the consumer 3Q 

passes the R_KEY and a virtual address of a buffer in that memory region ^-j 
to another consumer. That remote consumer supplies that R_Key in its 
RDMA WQEs that will access memory in the original node. See 3.5.3 for 

detailed description of R_Key usage. 33 

34 

3.5.5 Protection Domains 35 

Not only does memory registration allow the use of virtual memory ad- 36 

dressing, but it also provides an increased level of protection against in- 37 

advertent and unauthorized access. 38 

39 

Since a consumer might communicate with many different destinations 
but not wish to let all those destinations have the same access to its reg- 
istered memory, IBA provides protection domains. Protection domains 
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3.5.6 Partitions 



3.5.7 Virtual Lanes 



allow a consumer to control which set of its Memory Regions and Memory 
Windows can be accessed by which set of its QPs. 

Before a consumer allocates a QP or registers memory, it creates one or 
more protection domains. QPs are allocated to, and memory registered 
to, a protection domain. L_Keys and R_Keys for a particular memory do- 
main are only valid on QPs created for the same protection domain. 



Partitioning enforces isolation among systems sharing an InfiniBand 
fabric. Partitioning is not related to boundaries established by subnets, 
switches, or routers. Rather a partition describes a set of endnodes within 
the fabric that may communicate. 

Each port of an endnode is a member of at least one partition and may be 
a member of multiple partitions. A partition manager assigns partition keys 
(P_Keys) to each channel adapter port. Each P_Key represents a parti- 
tion. Each QP^ and EE context is assigned to a partition and uses that 
P_Key in all packets It sends and inspects the P_Key in all packets it re- 
ceives. Reception of an Invalid P_Key causes the packet to be discarded. 

Switches and routers may optionally be used to enforce partitioning. In 
this case the partition manager programs the switch or router with P_Key 
information and when the switch or router detects a packet with an invalid 
P_Key, It discards the packet. 



Virtual lanes (VL) provide a mechanism for creating multiple virtual links 
within a single physical link. A virtual lane represents a set of transmit and 
receive buffers in a port. All ports support VL15 which is reserved exclu- 
sively for subnet management. There are 1 5 other VLs (VLq to VL14) 
called data VLs and all ports support at least one data VL (VLq) and may 
provide VLi to VLn.,, where n Is the number of data VLs the port supports). 

The actual data VLs that a port uses is configured by the SM and is based 
on the Service Level (SL) field in the packet. The default is to use VL© until 
the SM determines the number of VLs that are supported by both ends of 
the link and programs the port's SL to VL mapping table. 



1 . Except QPO, QP1 , and QPs configured for Raw Datagrams type of 
service. 
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3.5.8 Quality OF Service 



The port maintains separate flow control over each data VL such that ex- 
cessive traffic on one VL does not block traffic on another VL. 



Management 
VL 



Data J 
VLs 




Figure 18 Virtual Lanes 

VL assignment exists only between ports at each end of a link and VL as- 
signment on one link is independent of assignments on other links. 

Each packet has a SL which is specified in the packet header. As a packet 
traverses the fabric, its SL determines which VL will be used on each link. 
Each port maintains a table of SL to VL mapping such that a packet is sent 
on the appropriate VL. 

When the ports at each end of a link support a different number of data 
VLs, the port with the higher number degrades to the number supported 
by the other port. Thus for ports that only support a single data VL, all data 
traffic defaults to VLq. 



IBA provides several mechanisms that permit a subnet manager to ad- 
minister various quality of service guarantees for both connected and con- 
nectionless services. These mechanisms are Service Level, Service 
Level to Virtual Lane Mapping, and Partitions. IBA does not define quality 
of service (QoS) levels (e.g., best effort). 
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3.5.8.1 Service Level 



3.5.8.2 SL TO VL MAPPING 



IBA defines a Service Level (SL) attribute that permits a packet to operate 
at one of 16 service levels. The definition and purpose of each service 
level is outside the scope of the architecture and left as a fabric adminis- 
tration policy. Thus the assignment of service levels is a function of each 
node's communication manager and its negotiation with a subnet man- 
ager. 



Another IBA mechanisms that is tied to service levels is virtual lanes. 
Each pacl<et identifies its SL and as the packet traverses the fabric, the 
packet's SL determines which VL is used on the next link. To this end, 
each port (switches, routers, endnodes) has a SL to VL mapping table that 
is configured by subnet management. Naturally, for all links that terminate 
at a port that only supports one data VL, all SLs map to VLq. Otherwise, 
subnet management policy determines the mapping of each SL to an 
available VL. 

Packets addressed to QPO are Subnet Management Packets (SMP) and 
exclusively use VL15 and their SL is ignored. VL1 5 (the management VL) 
is not a data VL and is not used for packets not addressed to QPO. 



3.5.8.3 PARTITIONS 



Another IBA mechanism that can be tied to service levels is partitioning. 
Fabric administration can assign certain SLs for particular partitions. This 
allows the SM to isolate traffic flows between those partitions and even if 
both partitions operate at the same QoS level, each partition can be guar- 
anteed its fair share of bandwidth regardless of whether nodes in other 
partitions misbehave or are over subscribed. 

3.5.9 Injection Rate Control 

IBA defines a number of different link bit rates. The lowest bit rate of 2.5 
Gb/sec is referred to as a 1x (times one) link. Other link rates are 
10Gb/sec (4x) and 30 Gb/sec (1x2). To support multiple link speeds within 
a fabric, IBA defines a Static Rate Control mechanism that prevents a port 
with a high speed link from overrunning the capacity of a port with a lower 
speed link. 

As part of the path resolution process, the SubnAdm:PathRecord pro- 
vides the node with MTU and rate information for the path. Path informa- 
tion is used since either a switch port or the endnode could be the limiting 
factor. 

The example in Figure 19 illustrates that port A with a 12x link speed has 
the potential for injecting traffic at 3 times the capacity of port B and 12 
times the capacity of ports C, D, or E. Additionally port B has the potential 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 76 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Architectural Overview 



October 24, 2000 
FINAL 



3.5.10 Addressing 



for injecting traffic at 4 times the capacity of port C, D, or E. Since traffic 
tends to be bursty, every time port A sends to one of the other ports, the 
fabric has a high probability of congesting. Linl< flow control keeps the 
fabric from loosing packets due to that congestion, but the back pressure 
will effect other paths that otherwise would not be congested. 

IBA solves this problem by defining a static rate control mechanism for 
ports that operate at link speeds greater than 1x. 




Figure 19 Rate Matching Example 



Each destination has a time-out value associated with it and that time-out 
value is based on the ratio between the source and destination bit rates. 
When the source and destination bit rates are equal, the time-out values 
is 0 (not needed). Othenwise when the port transmits a packet to a desti- 
nation, it puts that destination LID and a time-out value in its static rate 
control table. The port removes the entry after the time-out period expires. 
While the entry remains in the table, the port does not send any more 
packets to that destination (i.e., defers to traffic for other destinations not 
in the table). When there Is no entry in the table, the port transmits the 
packet by placing it on the appropriate VL output queue. 



Each endnode contains one or more channel adapters and each channel 
adapter contains one or more ports. Additionally each channel adapter 
contains a number of queue pairs (QP). 

Each QP has a queue pair number (QPN) assigned by the channel 
adapter which uniquely identifies the QP within the channel adapter. 
There are two well-known QPs for each port (QPO and QP1 ) and all other 
QPs are configured for operation through a particular port. For reliable da- 
tagram service, it is the EE context rather than the QP context that deter- 
mines the port. 
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Packets other than raw datagrams contain the QPN of the destination QP. 1 

When the channel adapter receives a packet, it uses the context of the 2 

destination QPN (and EE context for reliable datagram) to process the 3 

packet. 4 

Each port has a Local ID (LID) assigned by the local subnet manager (i.e., 
the subnet manager for the subnet). Within the subnet, LIDs are unique. ^ 
Switches use the LID to route packets within the subnet. The local subnet 7 
manager configures routing tables in switches based on LIDs and where 8 
that port is located with respect to the specific switch. Each packet con- 9 
tains a Source LID (SLID) that identifies the port that injected the packet 
into the subnet and a Destination LID (DLID) that identifies the port where ^ ^ 
the fabric is to deliver the packet. 

12 

IBA also provides for multiple virtual ports within a physical port by de- 13 
fining a LID Mask Control (LMC). The LMC specifies the number of least 1 4 
significant bits of the LID that a physical port masks (ignores) when vali- 1 5 
dating that a packet DLID matches its assigned LID. Those bits are not ig- ^ 5 
nored by switches, therefore the subnet manager can program different ^ ^ 
paths through the fabric based on those least significant bits. Thus the 
port appears to be 2'-'^^ ports for the purpose of routing across the fabric. 

Each port also has at least one Global ID (GID) that is in the format of an 20 

IPv6 address. GIDs are globally unique. Each packet optionally contains 21 

a Global Route Header (GRH) specifying a Source GID (SGID) that iden- 22 

tifies the port that injected the packet into the fabric and a Destination GID 23 

(DGID) that identifies the port where the fabric is to deliver the packet. 24 

Routers use the GRH to route packets between subnets. Switches ignore ^ 
the GRH. 

25 

Each channel adapter has a Globally Unique Identifier (GUID) called the 27 

Node GUID assigned by the channel adapter vendor. Each of its ports has 28 

a Port GUID also assigned by the channel adapter vendor. The Port GUID 29 

combined with the local subnet prefix becomes a port's default GID. 3Q 

31 
32 
33 
34 

The address of a QP is the combination of the port address (GID + LID) 35 
and the QPN. To communicate with a QP requires a vector of information 35 
including the port address (LID and/or GID), QPN, service level, path 3-^ 
MTU, and possibly other information. This information can be obtained by 
a path query request addressed to Subnet Administration. 

39 

Service IDs are used to resolve QPs. Some Service IDs are well known 40 
(i.e., certain functions have a predetermined Service ID) and some are ad- 41 
vertised in an I/O controller's Service Entries list. The subnet manager 42 



Subnet administration provides a GUID to LID/GID resolution service. 
Thus a node can persistently identify another node by remembering a 
Node or Port GUID. 
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3.5.11 Multicast 



provides the GUID to GID/LID resolution, but the target provides a Service 
ID to QP resolution as part of the communication management process. 

In general, the target node of a Request for Communication message 
uses the Service ID to direct the request to the entity who decides whether 
to proceed with communication establishment. If the decision is affirma- 
tive, the target returns the information necessary to establish communica- 
tion, which includes the QPN plus other infonnation specific to the 
transport service type. 

A simplified address resolution process is illustrated in Figure 20. 
Subnet 

Administration Initiator 




Target 
Node 



GetPathRecord 
_ SL,etc. 



GetUnittnfo 
UstofProtocol^Servic^ 

-£f:!::!f^(Service/D) 
Response (QPNL 



Figure 20 Simplified Address Resolution Process 

In the illustration, the target is an I/O controller where the initiator learns 
the Service ID by querying the IOC for a list of I/O protocols supported. 
The second path resolution is only necessary if the service being estab- 
lished uses different path characteristics (SL, QoS, MTU, etc.) than the 
management MADs. 



Multicast is a one-to-many / many-to-many communication paradigm de- 
signed to simplify and improve the efficiency of communication between 
a set of end nodes. 

Each multicast group is identified by a unique LID and GID. The LID is 
only unique within the subnet. A node joins and leaves a multicast group 
through a management action where the node supplies the LID for each 
port that will participate. This information is distributed to the switches. 
Each switch is configured with routing information for the multicast traffic 
which specifies all of the ports where the packet needs to travel. Care is 
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taken to assure there are no loops (i.e., a single spanning tree such that 1 
a packet is not forwarded to a switch that already processed that packet). 2 

3 

The node uses the multicast LID and GID in all packets it sends to that ^ 
multicast group. When a switch receives a multicast packet (i.e., a packet 
with a multicast LID in the packets DLID field) it replicates the packet and 
sends it out to each of the designated ports except the arrival port. In this ^ 
fashion, each cascaded switch replicates the packet such that the packet 7 
arrives only once at every subscribed endnode. 8 

9 

The channel adapter may limit the number of QPs that can register for the ^ q 
same multicast address. The channel adapter distributes multicast 
packets to QPs registered for that multicast address. A single QP can be 
registered for multiple addresses for the same port but if a consumer 
wishes to receive multicast traffic on multiple ports it needs a different QP 1 3 
for each port. The channel adapter recognizes a multicast packet by the 14 
packet's DLID or by the special value in the packet's Destination QP field 1 5 
and routes the packet to the QPs registered for that address and port. ^ 5 
Note that the Destination QP field in a multicast packet is not a QPN. 



3.5.11.1 Multicast Example 



Figure 21 illustrates an example unreliable multicast IBA operation: 
• A packet with PSN = 1129 is received on an IBA routing element 



11 
12 



17 
18 
19 
20 
21 



(switch or router) port. 22 

The switching / routing element examines the packet header and 

extracts the DLID / multicast GID to determine if it corresponds to 24 

a multicast group. An implementation may maintain this data as 25 

part of its internal route table, e.g. a bit-mask which corresponds 26 

to the output ports this packet should be forwarded. 27 
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Switches or routers replicate the packet (implementation depen- 
dent) and forwards the packet onto the output port(s). 
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Switch decodes inbound packet header 
(LRH) DLID to determine target output 
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I each output port. 



Ipgff 



i Port 



Control QPs 



Endnode 



Port 





Router decodes inbound packet header (GRH) IPv6 mul 
ticast address to determine target output ports. Packet is 
replicated and forwarded to each output port. 

Figure 21 Example Unreliable Multicast Operation 
3.5.11.2 Group Management 

IBA V 1.0 does not define the multicast group management protocol to 
used to implement join and leave operations. However, the management 
interface and associated MADs to implement a multicast group protocol is 
specified. While these mechanisms are part of the Subnet Administration 
(SA), some actions are implicitly performed by the Subnet Manager (SM). 
For the following discussion, the term multicast management entity is 
used to describe the SA/SM expected responsibility with respect to multi- 
cast management. Refer to the Subnet Administration attributes of Multi- 
cast Group Record and Multicast Member Record for more information. 

3.5.11.2.1 Multicast Group Create 

The multicast group creation is an explicit operation in IBA, in order to pro- 
vide a single control of group characteristics and allow members to join 
subversively. The group has to be created by the multicast management 
entity before a join can be successful: 

1) An (administrative) application defines (or determines) a target multi- 
cast group address (GID). It specifies particular group characteristics 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand*'*" Trade Association 



Page 81 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 Architectural Overview October 24, 2000 

Volume 1 - General Specifications FINAL 

(PMTU, raw, P-Key, etc.) and creates the multicast group by invoking 1 

a multicast group create to the multicast management entity. This ap- 2 

plication may request a specific multicast LID or have one allocated 3 

for it. 4 

2) The multicast management entity may notify appropriate routers on 5 
the subnet of the new group which is being created (not defined in 5 
IBA V 1.0). The router protocol should determine whether this mul- 
ticast group is in operation within another subnet. If so the router re- 
turns the PMTU of the existing multicast group to determine whether 
the create is allowed or not. 



3) The multicast management entity maps the multicast group address 
to an unused multicast LID or to the requested multicast LID. 



3.5.11.2.2 Multicast Group Join 



7 
8 

9 

10 
11 
12 
13 

The multicast group join algorithm (applies to IBA and raw multicast 
groups) is defined as follows: ^ ^ 

1 ) Application defines or determines the target multicast group address ^ ^ 
and invokes a multicast join operation. 17 

1 R 

2) The underlying join implementation determines if the associated 
endnode is participating in the multicast group. If it is, the application '^^ 
establishes a new local QP and performs the steps required to join 20 
this group. If not, the application invokes the management interface 21 
to communicate with the multicast group management entity. 22 

3) The multicast management entity performs the following steps upon 23 
receiving a join request: 24 

a) Validate the multicast group address - fail join operation if invalid. 25 

26 

b) Validate the requested PMTU - fail join operation if invalid. 

c) Verify the switch attached to the endnode is capable of multicast 28 
operation. The switch either supports multicast operation via 29 
packet replication or it can be configured to send all multicast 
packets to the endnode-attached port. 

31 

d) If the multicast group address is currently in operation within this ^2 
subnet, take the following actions: 

0*3 

i) Verify all switches and routers which are participating in this 34 
multicast group can support the requested PMTU. If they can- 25 

not, the join operation fails. „^ 

36 

ii) Each multicast group is implemented by defining a logical 3-^ 
routing tree across the participating switches. Rebuild / modi- 

fy the routing tree to include the new endnode. The multicast 
management entity informs fabric management to update the 
associated route forwarding tables within all switches and 40 
routers to reflect this new topology. 41 

42 
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e) If the multicast group address is not operating within this subnet, 1 
take the following steps. 2 

i) Inform each router within this subnet of the join operation. The 3 
router protocol should determine whether this multicast group 4 
is in operation within another subnet. If so, the router returns 5 
the PMTU of the existing multicast group to determine wheth- g 
er the create and subsequent join operation is allowed or not. 

ii) Map the multicast group address to an unused multicast LID. a 

ill) Establish a multicast routing tree and update the associated 9 

switch and router route forwarding tables accordingly. 10 

iv) Create the group and assign the PMTU to the multicast 11 
group. 12 

v) Return the multicast LID and associated group characteristics 1 3 
to the endnode and allow multicast operations to be initiated. 14 

f) Each router within this subnet is informed of successful multicast ^ ^ 
join operation. Routers invoke the appropriate multicast group 16 
management operations to add this subnet as participating in the 17 
associated multicast group. This protocol is outside the IBA spec- 
ification. 

4) Add the member to the group. 20 

3.5.11.2.3 Multicast Group Leave 21 

When an application leaves a multicast group, the following algorithm is 22 

used: 23 

24 

1 ) The application's QP is removed as a target for the multicast group. If 25 
there are QPs still participating in this multicast group, no further ac- 25 
tion is required. 27 

2) If there are no more QPs on this port participating within the multicast 28 
group, the leave implementation informs the multicast management 29 
entity that this endnode is no longer participating in this multicast 
group. The multicast management entity takes the following step: 

3 1 

a) Update the switch and router route forwarding table(s) to effec- ^2 

tively remove this endnode as a target for packets associated 

with this multicast group. 

^ ^ 34 

b) Remove the member from the group. 35 

3.5.11.2.4 Multicast Group Delete 36 

When an (administrative) application deems there is no need for a multi- 37 

cast group or there are no other endnodes participating in a multicast 33 
group, the a multicast group may be deleted. Upon receiving the delete 
request, the multicast management entity takes the following steps: 

41 
42 
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3.5.11.3 Multicast Prune 



1 ) Unmap the multicast LID from the multicast group address. 

2) Inform each router within this subnet that this subnet is no longer par- 
ticipating in the associated multicast group. 



To improve fabric efficiency, the multicast group management entity 
should periodically verify that all endnodes and routers participating within 
a multicast group are still participating and if they are not, it should prune 
them from the multicast group by performing the multicast group leave al- 
gorithm. The verification period is outside the scope of IBA VI. 0. 



3.5.12 Verbs 



IBA describes the service interface between a host channel adapter and 
the operating system by a set of semantics called Verbs. Verbs describe 
operations that take place between a host channel adapter and its oper- 
ating system based on a particular queuing model for submitting work re- 
quests to the channel adapter and returning completion status. 

The intent of Verbs is not to specify an API, but rather to describe the in- 
terface sufficiently permitting OS venders to define appropriate APIs that 
take advantage of the architecture. 

Verbs describe the parameters necessary for configuring and managing 
the channel adapter, allocating (creating and destroying) queue pairs, 
configuring QP operation, posting work requests to the QP, getting com- 
pletion status from the completion queue. 

3.6 Channel & Memory Seiviantics 

IBA communications provide the user with both channel semantics and 
memory semantics since both are useful for I/O and IPC. Channel seman- 
tics, sometimes called Send/Receive, refers to the communication style 
used in a classic I/O channel - one party pushes the data and the desti- 
nation party determines the final destination of the data. The message 
transmitted on the wire only names the destination's QP, the message 
does not describe where in the destination consumer's memory space the 
message content will be written. 

With memory semantics the initiating party directly reads or writes the vir- 
tual address space of a remote node. The remote party needs only com- 
municate the location of the buffer; it is not involved with the actual 
transfer of the data. 

Atypical I/O transaction might use a combination of channel and memory 
semantics. For example, a host process might initiate an I/O operation by 
using channel semantics to send a disk write command to an I/O device. 
The I/O device examines the command and uses memory semantics to 
read the data buffer directly from the memory space of the processor 
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node. After the operation is completed, the I/O unit then uses channel se- 1 
mantics to push an I/O completion message back to the processor node. 2 

3 

3.6.1 COMIVIUNICATION INTERFACE 4 

"Channel adapter'' is the term that identifies the hardware that connects a 5 

node to the IBA fabric (and includes any supporting software). The g 

channel adapter for a processor node is called a ''host channel adapter'' ^ 
(HCA) and a channel adapter in an I/O node is a "target channel adapter'' 

(TCA). A consumer communicates through one or more "queue pairs" ^ 

(QP). An HCA typically supports hundreds or thousands of CPs while a 9 

TCA might support less than ten QPs. 10 

11 

It is the QP that is the communication interface. The user initiates work re- 1 2 

quests (WR) that causes work items, called WQEs, to be placed onto the ^ ^ 
queues and the channel adapter executes the work item. 

Specifically, the operations supported for Send Queues are: ^ 

16 

• Send Buffer a channel semantic operation to push a local buff- 1 7 
er to a remote QP's receive buffer. The Send WR includes a gath- ^ g 
er list to combine data from several virtually contiguous local ^ g 
buffer segments into a single message that is pushed to a remote 
QP's Receive Buffer. The local buffer's virtual addresses must be 
in the address space of the consumer that created the local QP. 

• RDMA Read ~ a memory semantic operation to read a virtually 
contiguous buffer on a remote node. The RDMA Read operation 
reads a virtually contiguous buffer on a remote endnode and 24 
writes the data to a local memory buffer. 25 

Like the Send operation, the local buffer must be in the address ^6 
space of the consumer that created the local QP. 27 

28 

The remote buffer must be in the address space of the remote con- 
sumer owning the remote QP targeted by the RDMA Read. 29 

30 

• RDMA Write - a memory semantic operation to write a virtually 
contiguous buffer on a remote node. The WR contains a gather 

list of local buffer segments and the virtual address of the remote 32 
buffer into which the data from the local buffer segments are writ- 33 
ten. 34 

Like the Send WR, the local buffer must be in the address space 35 
of the consumer that created the local QP. 36 

The remote buffer must be in the address space of the remote con- 37 
sumer owning the remote QP targeted by the RDMA Write. 38 

• Atomic - a memory semantic operation to do an atomic opera- 39 
tion on a remote 64 bit word. The Atomic operation is a combined 40 
Read, Modify, and Write operation. 41 

42 
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An example of an Atomic operation is the Compare and Swap if 1 

Equal operation. The WR specifies a remote memory location, a 2 

compare value, and a new value. The remote QP reads the spec- 3 

ified memory location, compares that value to the compare value ^ 
supplied in the message, and only if those values are equal, then 

the QP writes the new value to that same memory location. In ei- ^ 

ther case the remote QP returns the value it read from the memory ^ 

location to the requesting QP. The other atomic operation is the 7 

FetchAdd operation where the remote QP reads the specified 8 

memory location, returns that value to the requesting QP, adds to 9 

that value a value supplied in the message, and then writes the re- ^ q 

suit to that same memory location. ^ ^ 

• Memory Bind - a memory management operation that changes 1 2 

the binding of a memory window. The Bind Memory Window op- ^ ^ 
eration associates a previously allocated Memory Window to a 
specified address range within an existing Memory Region, along 

with a specified set of remote access privileges. ^ ^ 

1 6 

For Receive Queues, there is only a single type of WR: 

17 

Post Receive Buffer ~ a channel semantic operation describing 18 

a local buffer into which incoming Send messages are written. 19 

The WR includes a scatter list describing several local buffer seg- 20 

ments. The contents of an incoming Send message is written to 21 

these buffer segments in the order specified. The buffer's virtual 22 
addresses must be in the address space of the consumer that 
created the local QP. A WR without a scatter-gather list may be 

used to receive Immediate Data from a Write or a zero length 24 

Send operation. 25 

Zero processor copy data transfer, with no kernel involvement, is key in 

providing high bandwidth, low latency communication. A consumer can 27 

transfer a data buffer via the QP directly from where the buffer resides in 28 

memory. Furthermore, the protection provided by R_Keys & L_Keys 29 

(memory registration) removes the need for the OS to validate data trans- 30 
fers. Thus the OS may allow posting the WQE from user-mode, bypassing 
the operating system, and thus consuming fewer instruction cycles. 

32 

IBA operations support the use of virtual addresses and existing virtual 33 

memory protection mechanisms to assure correct and proper access to 34 

all memory. Thus IBA applications are not required to use physical ad- 35 

dressing for any operation. 3g 

37 

A consumer can use either of two mechanisms to enable remote access 
to its memory (RDMA). First, when registering its memory (creating a 
Memory Region), the consumer can simply enable remote access for the 

entire Memory Region. If more control of remote access is desired, the 40 

consumer can allocate a Memory Window and bind it to a subset of an ex- 41 

isting Memory Region. Either approach results in an R_Key. The con- 42 
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sumerthen provides that R_Key and the virtual address of the data buffer 
to a remote node for use in subsequent RDMA operations. Only an in- 
coming RDMA request with a correct R_Key can gain access to the spe- 
cific area of memory. Furthermore, the QP and the Memory Region or 
Window must be in the same protection domain. 

3.6.2 IBA Transport Services 

The IBA transport mechanisms provide multiple classes of communica- 
tion services. When a QP is created, it is configured to provide one of 
these classes of transport services: 

Reliable Connection (acknowledged - connection oriented) 

• Reliable Datagram (acknowledged - multiplexed) 
Unreliable Connection (unacknowledged - connection oriented) 

• Unreliable Datagram (unacknowledged - connectionless) 

• Raw Datagram (unacknowledged - connectionless) 

The Reliable Connection service associates a local QP with one and 
only one remote QP. Thus a Send Buffer WQE placed on one QP causes 
data to be written into the Receive Buffer of the associated QP. RDMA op- 
erations operate on the address space of the associated QP. 

A connected service requires each consumer to create a QP for each con- 
sumer with which it wishes to communicate. Thus if there are M con- 
sumers on each of N platforms that all wish to communicate via connected 
class of service, then each platform requires M^ * N QPs. This assumes 
that each consumer on a particular platform communicates with con- 
sumers (including itself) on that same platform by taking advantage of the 
ability to connect a QP to a QP on the same node. 

! Node 2 ! 
Consumer 




Figure 22 Connected Service 
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1 

The Reliable Connection is reliable because the channel adapter can 2 

maintain sequence numbers and acknowledges all messages. A combi- 3 

nation of hardware and channel adapter software retries any failed com- 4 

munications. The consumer of the QP sees reliable communications even 5 

in the presence of bit errors, receive buffer under runs, network conges- g 
tion, and if alternate paths exist in the fabric, failures of fabric switches or 

links. ^ 

8 

The acknowledgments ensure data is delivered reliably between the as- 9 
sociated QPs and thus between each node's memory. 10 

11 

The acknowledgment is not a consumer level acknowledgment - it 

doesn't validate that the receiving consumer has consumed the data. The ^ ^ 

acknowledgment only means the data has reached the destination. 

The Unreliable Connection service also associates a local QP with one 1 5 

and only one remote QP Thus a Send Buffer request placed on one QP 16 

causes data to be written into the Receive Buffer of the associated QP. 1 7 
RDMA Write operations operate on the address space of the associated 

QP- 19 

20 

Unlike reliable connection service, unreliable connection does not ac- 
knowledge and thus does not have the ability to resend lost or corrupted 

messages. Rather, lost or corrupted messages are simply dropped. Since 22 

there is no acknowledgment, RDMA Reads and Atomic operations are not 23 

supported. Because packets of an RDMA Write might be lost or corrupted, 24 

partial writing of a buffer might take place, but once a missing or corrupted 25 
packet is received, the write operation ceases until the start of a new mes- 
sage. 

The Unreliable Datagram service is connectionless and unacknowl- 28 

edged. It allows the consumer of the QP to communicate with any unreli- 29 

able datagram QP on any node. Receive operation allows incoming 30 

messages from any unreliable datagram QP on any node. 3^ 



32 
33 



The Unreliable Datagram service greatly improves the scalability of IBA 
Since the service is connectionless, an endnode with a fixed number of 
QPs can communicate with far more consumers and platforms compared 
to the number possible using the Reliable Connection and Unreliable 35 
Connection service. 36 

37 
38 
39 
40 
41 
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Node 2 




Figure 23 Datagram Service 

This is the class of service used by management to discover and integrate 
new IBA switches and endnodes. It does not provide the reliability guar- 
antees of the other service classes, but operates with less state main- 
tained at each endnode. Unlike other services where messages are 
conveyed by multiple packets, the maximum message length is con- 
strained in size to fit in a single packet. 

The Reliable Datagram (RD) service is multiplexed over connections be- 
tween nodes called End-to-end contexts (EEC) which allows each RD QP 
to communicate with any RD QP on any node with an established EEC. 
Multiple QPs can use the same EEC and a single QP can use multiple 
EECs (one EEC for each remote node per reliable datagram domain). 

A reliable datagram domains (RDD) determine which sets of RD QPs can 
access which sets of EECs. Some possible reasons for multiple RDDs are 
traffic in different partitions, different QoS characteristics, security, and 
performance. 

The EEC uses sequence numbers and acknowledgments associated with 
each message to ensure the same degree of reliability as with the Reli- 
able Connection service. Each channel adapter maintains end-to-end 
specific state for each node to keep track of the sequence numbers, ac- 
knowledgments, and time-out values. Each EEC is shared by all Reliable 
Datagram QPs for that RDD. 

For reliable datagram service on a per RDD basis, each consumer needs 
only to create a single QP and the node creates an EE context for each 
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platform with which it communicates. Thus if there are M consumers on 1 
each of N platforms that all wish to communicate via IBA reliable datagram 2 
service, then each platform requires M QPs and N end-to-end contexts. 3 

4 




Figure 24 Reliable Datagram Service 



The Raw Datagram service is not technically a transport but rather it is a 
data link service that allows a QP to send and receive raw datagram mes- 
sages. There are two types of raw datagram service (EtherType and 
IPv6). The EtherType raw datagram packet contains a generic transport 
header that is not interpreted by the channel adapter, but it specifies the 
protocol type. The IPv6 raw datagram contains a global route header that 
identifies the protocol type. 

Using IPv6 raw datagram service, the IBA channel adapter can support 
standard protocols layered atop IPv6, such as TCP and UDP. Thus native 
IPv6 packets can be bridged into the IBA SAN and delivered directly to a 
port and to its IPv6 raw datagram QP. This allows the raw datagram QP 
consumer to support multiple transport protocols. 

Using EtherType raw datagram service, the IBA channel adapter can sup- 
port standard protocols the same as Ethernet, including TCP and UDP as 
well as IPv4. Thus native ethernet packets can be bridged into the IBA 
subnet and delivered directly to a port and to its EtherType raw datagram 
QP 

When the QP is created, the consumer registers with the channel adapter 
in order to direct received datagrams to it (one QP for IPv6 and one for 
EtherType). 
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3.7 IBA Layered Architecture 

IBA operation can be described as a series of layers. The protocol of each 
layer is independent of the other layers. Each layer is dependent on the 
service of the layer below it and provides sen/ice to the layer above it. 
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3.7.1 Physical Layer 



Figure 25 IBA Architecture Layers 



The physical layer specifies how bits are placed on the wire to form sym- 
bols and defines the symbols used for framing (i.e., start of packet & end 
of packet), data symbols, and fill between packets (Idles). It specifies the 
signaling protocol as to what constitutes a validly formed packet (i.e., 
symbol encoding, proper alignment of framing symbols, no invalid or non- 
data symbols between start and end delimeters, no disparity errors, syn- 
chronization method, etc.). 
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3.7.2 Link Layer 



Figure 26 IBA Packet Framing 

The physical layer specification is in Volume 2. It specifies the bit rates, 
nnedia, connectors, signaling techniques, etc. 



The link layer describes the packet format and protocols for packet oper- 
ation, e.g. flow control and how packets are routed within a subnet be- 
tween the source and destination. There are two types of packets. 

Link Management Packet - these are packets used to train and 
maintain link operation. These packets are created and con- 
sumed within the Link Layer and are not subject to flow control. 
Link management packets are used to negotiate operational pa- 
rameters between the ports at each end of the link such as bit 
rate, link width, etc. They are also used to convey flow control 
credits and maintain link integrity. Link management packets are 
never forwarded to other links. 

• Data Packet - these are the packets that convey IBA operations 
and they consist of a number of different headers, which might or 
might not be present. 
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Figure 27 IBA Data Packet Format 



The Local Route Header (LRH) is always present and it identifies the 
local source and local destination ports where switches will route the 
packet and also specifies the Service Level (SL) and VL on which the 
packet travels. The VL is changed as the packet traverses the subnet but 
the other fields remain unchanged. 
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3.7.3 Network Layer 



The subnet manager assigns unique LIDs to each port of a channel 
adapter as well as the nnanagement entity of a switch. The source places 
the LID of the destination in the LRH and switches route the packet to that 
destination. If the packet is to be routed to another subnet, the packet's 
destination LID contains the LID of a router, otherwise the packet's desti- 
nation LID specifies a LID assigned to a channel adapter (or switch, for 
certain of management packets). 

There are two CRCs in each packet. The Invariant CRC (ICRC) covers 
all fields which should not change as the packet traverses the fabric. The 
Variant CRC (VCRC) covers all of the fields of the packet. The combina- 
tion of the two CRCs allow switches and routers to modify appropriate 
fields and still maintain an end to end data integrity for the transport con- 
trol and data portion of the packet. The coverage of the ICRC is different 
depending on whether the packet is routed to another subnet (i.e. con- 
tains a global route header). 

Link level flow control is a credit based method where the receiver on 
each link sends credits to the transmitter on the other end of the link. 
Credits are per VL and indicate the number of data packets that the re- 
ceiver can accept on that VL. The transmitter does not send data packets 
unless the receiver indicates it has room. VL15 (the management VL) is 
not subject to flow control. 



The network layer describes the protocol for routing a packet between 
subnets. 

The Global Route Header (GRH) is present in a packet that traverses 
multiple subnets. The GRH identifies the source and destination ports 
using GID in the format of an IPv6 address. Routers forward the packet 
based on the content of the GRH. As the packet traverses different sub- 
nets, the routers modify the content of the GRH and replace the LRH. But 
the source and destination GIDs do not change and are protected by the 
ICRC field. Routers recalculate the VCRC but not the ICRC. This pre- 
serves end to end transport integrity. 

Each subnet has a unique subnet ID, the Subnet Prefix. When combined 
with a Port GUID, this combination becomes a port's Global ID (GID). A 
node might have other locally administrated Global IDs. The source 
places the GID of the destination in the GRH and the LID of the router in 
the LRH. Each router forwards the packet through the next subnet to an- 
other router until the packet reaches the target subnet The last router re- 
places the LRH using the LID of the destination. 
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3-7.4 Transport Layer 



The network and link protocols deliver a packet to the desired destination. 
The transport portion of the packet delivers the packet to the proper QP 
and instructs the QP how to process the packet's data. 

The transport layer is responsible for segmenting an operation into mul- 
tiple packets when the message's data payload is greater than the max- 
imum transfer unit (MTU) of the path. The QP on the receiving end 
reassembles the data into the specified data buffer in its memory. 
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Figure 28 Segmentation of Data 
The Base Transport Header (BTH) is present in all packets except for 
RAW datagrams. It specifies the destination QP and indicates the opera- 
tion code, packet sequence number, and partition. 

The operation code identifies if the packet is the first, last, intermediate, or 
only packet of a message and specifies the operation (Send, RDMA Write, 
Read, Atomic). 

The packet sequence number (PSN) is initialized as part of the communi- 
cations establishment process and increments each time the QP creates 
a new packet. The receiving QP tracks the received PSN to determine if 
it lost a packet. For reliable service, the receiver sends an ACK or NAK 
packet back to notify the sender that packets were or were not received 
correctly. In this case the recipient discards subsequent packets until the 
sender resends the missing messages. For unacknowledged service, 
when the recipient detects a missing packet, it aborts the current opera- 
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tion and discards all subsequent packets until it receives one that speci- 1 
ties a first or only operation code. Then operation continues. 2 

3 
4 
5 

For reliable datagram service, the ETH identifies the EE context that the ^ 
QP uses to detect missing packets. 7 

8 

The first message of an RDMA Read or Write operation contains an 9 
RDMA ETH that specifies the virtual address, R_Key, and total length of 
the data buffer to read or write. Subsequent RDMA write packets provide ^ ^ 
the remainder of the data. The QP validates that the memory is properly 
registered for access by that QP and that the total data written does not ^ ^ 
overrun the length specified. For an RDMA Read operation, the QP ^3 
fetches the data, segments it into Read Response packets and sends 14 
them to the originator. When receiving a RDMA response, the QP writes 1 5 
the data into the buffer specified in the WQE of the RDMA Read Request. ^ g 

17 

An Atomic operation contains an Atomic ETH that specifies the virtual ad- 
dress and R_Key of the memory location that is the object of the operation 
as well as 2 operands. The QP validates that the memory is properly reg- ^ ^ 
istered for access by that QP. The QP fetches the data, returns that value 20 
to the originator, performs the operation, and writes the result back to 21 
memory. For the Compare & Swap operation, the QP compares the con- 22 
tent of the memory location with the first operand, and if they match, then 23 
it writes the second operand to that same location. Otherwise it does not 
modify it. For the Fetch & Add operation, the QP performs an unsigned 
add using the 64-bit Add Data field in the Atomic ETH, and writes the re- 
suit back to the same memory location. In either case, operation is atomic 26 
such that another QP is not allowed to modify that memory location between the 27 
time of the read and the subsequent write. 28 



29 
30 



The Immediate Data (IMMDT) field is optionally present in RDMA WRITE 
and SEND messages. It contains data that the consumer placed in the 

Send or RDMA Write request and the receiving QP will place that value in 31 

the current receive WQE. An RDMA Write with immediate data will con- 32 

sume a receive WQE even though the QP did not place any data into the 33 

receive buffer since the IMMDT is placed in a CQE that references the re- 34 

ceive WQE and indicates that the WQE has completed. 35 

36 

For reliable connection service, IBA defines an end-to-end message level 

flow control. This allows the receiver to send credits to the transmitter as ^7 

WQEs are posted to the receive queue. The QP tracks the number of 38 

WQEs posted and retired from the receive queue and keeps track of the 39 

number of messages received. It adds these numbers together to achieve 40 

a message limit value which it sends to the transmitter on the other end of 4 ^ 

the connection. The transmitter keeps track of the total number of mes- ^2 
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sages that it creates and stops transmitting when it reaches the limit value 
established by the other end of the connection. 



3-7.5 Upper Layer Protocols 



3.7.5-1 Subnet Management 



As illustrated in Figure 29, IBA supports any number of upper layer proto- 
cols by various user consumers. IBA also defines messages and proto- 
cols for certain management functions. These management protocols are 
separated into Subnet Management and Subnet Services. Both of these 
have unique properties. 
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Figure 29 Upper Layers 



Subnet Management is actually divided between tlie Subnet Manager 
(SM) application and ttie Subnet Management Agent (SMA). Tfiere only 
needs to be one subnet manager per subnet and it can reside in any node 
including switches and routers. Subnet management uses a special class 
of Management Datagram (MAD) called a Subnet Management Packet 
(SMP) which is directed to a special queue pair (QPO). As illustrated in 
Figure 30, each port has a QPO, and each node contains an SMA that: 

• processes GetQ and Set() SMPs received on QPO 

• processes ActionQ SMPs received on QPO 

• sends GetRespQ SMPs out QPO 

• sends TrapQ SMPs out QPO. 
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A subnet manager: 



sends SMPs out QPO to any port's QPO 

processes all SMPs received on QPO except /\ctonO, GetQ, and 
SetQ SMPs which are processed by that node's SMA. 



^ubnet Manager Application (optional^ 

Message & Data Service 



Channel Adapter 
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3.7.5.2 General Services 



Figure 30 Subnet Management Elements 

General Service Agents (GSA*) actually consists of a number of manage- 
ment service agents as illustrated in Figure 31 . Some of the services are 
optional. General services use a message format called a General Man- 
agement Packet (GMP) which is a Management Datagram (MAD) and is 
normally directed to a special queue pair (QP1) called the General Ser- 
vice Interface (GSI). As illustrated in Figure 31 , each port has a QP1, and 
all GMPs received on QP1 are processed by the one of the GSAs. The 
GSA is actually able to redirect GMPs for its particular class of service to 
another queue pair, allowing each GSA to maintain its own communica- 
tion interface. 
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3.8 IBA Transaction Flow 
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Figure 31 General Services 



A consumer interacts with an IBA channel adapter through a data struc- 
ture called the Queue Pair, consisting of a Send Queue and a Receive 
Queue. A message is initiated by posting a work request which results in 
a WQE being placed on the Send Queue. 

The channel adapter detects the WQE posting and accesses the WQE. 
The channel adapter interprets the command, validates the WQE's virtual 
addresses, translates it to physical addresses, and accesses the data. 
The outgoing message buffer is split into one or more packets. To each 
packet the channel adapter adds a transport header (sequence numbers, 
opcode, etc.). If the destination resides on a remote subnet the channel 
adapter adds a network header (source & destination GIDs). The channel 
adapter then adds the local route header and calculates both the variant 
and invariant checksums. 

The flow of packets is subject to the link-level protocol over each link. 

A packet is the unit of information that is routed through the IBA fabric. The 
packet is an endnode-to-endnode construct, in that it is created and con- 
sumed by endnodes. As the packet passes through switches, the switch 
may need to change the virtual lane and thus must replace the variant 
CRC with a new value but it does not touch the invariant CRC. If the 
packet passes through a router, the router changes the local route header 
and updates fields in the global route header, again updating the variant 
CRC but not changing the invariant CRC. Each switch and router moves 
the packet closer to its ultimate destination. 
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When a packet arrives at its final destination it goes through normal va- 1 

lidity checks (e.g., framing violations, disparity, illegal characters, align- 2 

ment, etc.) and both VCRC and ICRC are checked for integrity. The 3 

transport header identifies the target QP and the channel adapter uses ^ 
context from that QP to validate that the packet came from the correct 
source, etc. and checks that the packet sequence number is valid (no 

missed packets). For a Send operation, the QP retrieves the address of ^ 

the receive buffer from the next WQE on its receive queue, translates it to 7 

physical addresses, and accesses memory writing the data. If this is not 8 

the last packet of the message, the QP saves the current write location in 9 

its context and waits for the next packet at which time it continues writing ^ q 

the receive buffer until it receives a packet that indicates it is the last ^ ^ 
packet of the operation. It then updates the receive WQE, retires it, and 

sends an acknowledge message to the originator. ^ ^ 

13 

For reliable service types, if the QP detects one or more missing packets, 1 4 

it sends a NAK message to the originator indicating its next expected se- 1 5 

quence number. The originator can then resend starting with the expected ^ 5 

packet. ^ J 

When the originator receives an acknowledgment, it creates a CQE on the 
CQ and retires the WQE from the send queue. 

20 

A QP can have multiple outstanding messages at any one time but the 21 
target always acknowledges in the order sent, thus WQEs are retired in 22 
the order that they are posted. 23 

24 
25 

IBA management defines a common management infrastructure for 26 

27 

• Subnet Management - provides methods for a subnet manager to 
discover and configure IBA devices and manage the fabric. 

29 
30 

Subnet administration - provides nodes with information gath- 
ered by the SM and provides a registrar for nodes to register 
general services they provide. 



32 
33 

Communication establishment & connection management be- 
tween endnodes 

35 

Mechanisms to discover and manage I/O devices "behind" 
channel adapters ^7 

Configuration management - an authority for assigning I/O re- 33 
sources to hosts gg 

Performance management - monitors and reports well-de- 49 
fined performance counters 

42 
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• Baseboard management - provides for power & chassis man- 
agement using IB-ML as defined in Volume 2 

• SNMP Tunneling (SNMP) - provides method for sending and 
receiving information between management agents and man- 
agement applications. This includes Simple Network Manage- 
ment Protocol (SNMP), Desktop Management Interface 
(DMI), and Common Information Model (CIM), as well as oth- 
er standard and proprietary interfaces. 

The subnet management physical and logical models are illustrated in 
Figure 32. The general service models are illustrated in Figure 33 and 
Figure 34. 
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Figure 32 Subnet Management Models 

Every channel adapter, switch, and router has a Subnet Management 
Agent (SMA) that responds to subnet management packets. Communica- 
tion between the SM and SMAs use a well-known interface called the 
Subnet Management Interface (SMI) where each port has a QP with QP 
Number 0 (QPO) that is dedicated to sending and receiving SMPs. 

Protection - The subnet manager can place a key (M_Key) in each node 
which can not be read by other nodes and prevents nodes without the 
M_Key from modifying a node's configuration. The SM only shares the 
M_Key with trusted peers as necessary. IBA also provides a lease expira- 
tion mechanism such that if the SM dies before is shares M_Key informa- 
tion with a successor, the lease expires, and the node returns to a state 
that allows the successor SM to establish a new M_Key 
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IBA management defines the underlying interfaces and principles that 
allow IBA devices and the corresponding fabric to be discovered, initial- 
ized, and controlled. It defines a common management model and frame- 
work applicable to IBA-managed elements, identifies those elements, and 
defines their managed features. Management applications use this infra- 
structure to manage the IBA devices and communicate with other man- 
agement applications. 
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Figure 34 General Services Logical Models 

IBA management infrastructure supports a number of different manage- 
ment service classes and logically provides for any node to host a class 
manager. Figure 34 illustrates different ways that management classes 
can use the management infrastructure. 
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• The Managed Agent model allows a class manger or manage- 1 
ment application to manage nodes through a General Service 2 
Agent (GSA) defined for that class present on each node to be 3 
managed. This is the same model used for subnet management ^ 
and is the model used for I/O device management, baseboard 
management, SNMP, and performance management classes. 

• The Peer Agents model allows managers resident on each node 
as a GSA to communicate with each other. This is the model 
used for communication management class. 

• The Managed Service model allows management applications to 
access class managers. This is the model used for subnet admin- 
istration and I/O resource management classes. 

IBA management entails a variety of concepts, including: 



5 
6 
7 
8 
9 
10 
11 
12 
13 

A means of configuring and gathering information from endnodes, 14 
switches and routers. 1 5 

A diagnostics framework as a common error handling mechanism. 



16 
17 
18 
19 
20 
21 

Subnet Management Packets (SMP) as a subset of the MADs to al- 



Installation and configuration services to allow for discovery and ini- 
tialization of the fabric and endnodes. 

A standard management packet called a "Management Datagram" or 
"MAD". 



low set and get operations specifically between the Subnet Manager 
and IBA devices. 



23 
24 
25 



• General Management Packets (GMP) as the remaining subset of the 
MADs that allow management operations between the Subnet Man- 
ager and IBA devices and management operations between IBA de- 26 
vices themselves. 27 

• Communication management services to allow setup and teardown 

of communications between channel adapters. 29 

30 

Partitioning services to configure ports of an endnodes to be mem- 
bers of one or more possibly overlapping sets called partitions. 

32 

IBA provides the means for the operating system to restrict access to the 
management infrastructure. For the SMI, subnet management packets 
must be sourced from QPO. The GSI uses a privileged Q_Key (i.e., a 34 
Q_Key with the most significant bit set). Host channel adapters do not 35 
permit a privileged Q_Key to be specified in a work request, rather the QP 35 
must be configured for privileged operation by configuring the QP context 37 
with the privileged Q_Key. This permits management applications and 
class mangers to maintain their own QPs. The GSI uses QP1 for initial 
communication but allows traffic for a particular class to be redirected to 
a privileged QP. 40 

41 

42 
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3.9.1 Management Interfaces i 

IBA defines two well known QPs for nnanagement interfaces. QPO is re- ^ 

served for subnet management and QP1 is designated for general man- 3 

agement services. 4 

5 

3.9.2 Subnet Management Interface g 

Every IBA port has a QP dedicated to subnet management. This is QPO. 7 

QPO has special features that make it unique compared to other QPs. 3 



3.9.2.1 Fabric Initialization 



3.9.2.2 Gets & Sets 



QPO is permanently configured for Unreliable Datagram class of 
service. 



9 
10 
11 
12 
13 
14 



• Each port of an IBT device has a QPO that sends and receives 
packets. 

QPO is a member of all partitions (i.e., can accept any packet 

• specifying any partition) 15 

• Only subnet management packets (SMPs) are valid 1 6 

• Traffic for QPO (i.e., SMPs) exclusively uses VL15, which is not 
subject to link-level flow control. 1 8 

19 
20 

The subnet manager uses this service interface to poll and configure the 
fabric. Switches support a special routing mode known as directed routing 
that allows SMPs to be routed through switches prior to switches being 22 
configured with their forwarding database and prior to nodes being as- 23 
signed local IDs. The subnet manager walks its way through the fabric 24 
sending SMPs to a device and discovering if it is a switch. Using directed 25 
routing, it can then send SMPs out each of the switch's ports to discover 25 
the devices connected to the switch. This process continues until the 
subnet manger discovers all of the devices and how they are intercon- 
nected. 28 

29 

Once the SM learns the subnet's topology, it configures each node with 30 
local IDs and configures the routing tables of switches. Once the fabric 31 
has been configured, SMPs can be sent using destination routing. 22 

33 

IBA allows multiple subnet mangers per subnet but only one can be the 
master manager. Thus IBA defines how a subnet manager detects the 
presence of another subnet manager and the arbitration mechanism for 35 
selecting which will be the master subnet manager. 36 

37 
38 

Gets and Sets are the most common use of SMPs. The SM polls the fabric 39 
and learns its topology by sending SMP Get Request messages. Each 40 
destination responds by sending a SMP Get Response message that in- 
eludes the requested data. The SM configures IBT devices by sending 
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3.9.2.3 Traps and Notices 



3.9.2.4 Actions 



3,9.2.5 Directed Routes 



SMP Set Request messages. This is effectively a Set & Get request. Each 
destination responds by sending a SMP Get Response message that in- 
cludes the data values after the set action. Since not all parameters are 
settable or they might have limits, the originator inspects the response 
message to determine the true effect of the set request message. 



A trap is a message sent by a management agent to its class manger 
when certain asynchronous management events occur (such as protocol 
violations). Notices are a list of management events that are queued at 
the managed node and may be retrieved and cleared by the class man- 
ager or management application. 

IBT devices use SMPs to send traps to the subnet manager when certain 
events occur. One such use is for a switch to send a trap to the subnet 
manager when it detects a state change on one of its ports (i.e., a topology 
change and/or device joining or leaving). Of course since SMPs are unre- 
liable, the SM can not solely depend on this type of notification, but suc- 
cessful traps will decrease the latency in managing the topology change. 



The SM also uses SMPs to send action commands to devices. Wake is 
one such action that, if supported, causes a node to power up and be- 
come operational. Another action is the node reset command. 



A SMP can specify the route it takes through the fabric. This is done by 
including in the SMP a list of port numbers that define a path through the 
subnet (i.e., the path vector). The path vector specifies the output port for 
each switch along the path. The packet contains two path vectors (one for 
the forward route and one for the reverse route), a direction bit that indi- 
cates which path vector to traverse, and a hop pointer that indicates the 
current position in the path vector. The reverse path vector is built by 
switches as they process the forward path vector. 

When a switch receives a directed routed SMP, it uses the current hop 
pointer to identify where in the path vector it is. If the direction is "forward" 
it determines the output port from the fonA/ard path vector, updates the re- 
verse path vector by adding the port number on which it received the SMP, 
increments the current hop pointer, and then forwards the packet out the 
specified output port. When the packet reaches the destination, the target 
device uses the reverse route field for the reply by simply changing the 
sense of the direction bit and sending the reply SMP out the port on which 
it was received. Because the direction is "reverse" each switch now dec- 
rements the current hop pointer, uses it to determine the original input 
port, and then sends the packet out that port. 
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3.9.3 General Service Interface 

Every IBA channel adapter has a QP dedicated to general fabric services. 
This is QP1 . QP1 has special features that make it unique compared to 
other QPs. 

• QP1 is permanently configured for Unreliable Datagram class of 
service. 

• Each port of an IBT device has a QP1 that sends and receives 
packets. 

QP1 is a member of all of the port's partitions (i.e., can accept 
any packet specifying a P_Key contained in the port's P_Key ta- 
ble). 

Only management datagrams (MADs) are valid 

• Traffic for QP1 does not use VL15 

3.9.3.1 Management Datagrams 

IBA defines a standard format for management messages which supports 
common processing. Each MAD contains the same header format that 
identifies the class of management message and the method. SMPs are 
one class of management message, another is directed route SMPs. 
Other classes are called General Management Packets (GMPs) and in- 
clude subnet administration, communication management, performance 
management, SNMP, device management, baseboard management. 

IBA defines common methods that may be adopted by any class. These 
include Get, Set, Trap, Notice, and Action. Of course each management 
class defines their own set of attributes. These methods are sufficient for 
many classes but IBA provides for class specific methods. 



3.9.3.2 Redirection 



3.10 I/O Operation 



QP1 being a well known interface has its advantages and disadvantages. 
One disadvantage is that all management classes go into the same queue 
which tends to bottleneck and promote head of line blocking. Thus IBA de- 
fines a mechanism that allows the channel adapter to redirect general ser- 
vice requests to other QPs. 

When a channel adapter receives a GMP on QP1, it may respond with a 
redirect response indicating a new port and QP. The originator then re- 
sends the request to the new address and also uses that address for all 
subsequent requests for that same management class. 



IBA I/O architecture supports a range of I/O implementation from simple 
native devices to massive I/O subsystems. The model for an I/O unit is 
shown in Figure 35. An I/O unit is composed of a channel adapter and a 
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number of I/O controllers. The channel adapter of an I/O unit is referred to 
as a Target Channel Adapter (TCA). A TCA has the same functionality as 
the HCA, but unlike the HCA, it is not necessarily designed for generic 
use, which means that it only needs to support the capabilities required by 
its controllers. 



I/O Unit 




I/O Controller 



I/O Controller 



o 
o 
o 




I/O Controller 




J 



J 



I/O Port or Devices 

Figure 35 I/O Unit 

I/O controllers represent the hardware and software that processes I/O 
transaction requests. Examples of I/O controllers are a SCSI interface 
controller, a RAID processor, a storage array processor, a LAN port con- 
troller, a disk drive controller, a console service. 

The I/O unit contains a Subnet Management Agent (SMA) that responds 
to SMPs received on QPO. The I/O unit also contains general service 
agents (GSA*) that responds to GMPs received on the GSI (QP1). The 
GSA* contains at least Communication Management and I/O Device 
Management (DevMgt). Each I/O controller is registered with the DevMgt 
GSA such that it can respond to DevMgt GMPs with specific information 
about the controller. 

Typically an I/O resource manager in the processor node sends DevMgt 
GMPs to an I/O unit to discover the attributes of the controllers. The at- 
tributes contain sufficient information for the I/O resource manager to 
identify the appropriate I/O driver. The I/O resource manager loads the 
driver, if necessary, and configures the I/O driver with the identity of the 
controller (10 Unit and Controller ID). The I/O driver then creates the ap- 
propriate communication ports (i.e., QPs) on the processor node and calls 
the processor node's communication manager to create the appropriate 
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connections with the I/O controller. Once the connections are established, 
the I/O driver exchanges control messages and data over the connec- 
tions. 



Upper Level ( I/O 
Protocols I Controller 
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Figure 36 lO Operation 



The number of communication ports used by the driver is an implementa- 
tion variable. An I/O driver may use any available class of service (reliable 
connection, unreliable connection, reliable datagram, or unreliable data- 
gram) and might use various classes of service for different communica- 
tion ports. 
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Chapter 4: Addressing 



This chapter defines IBA addressing terminology and concepts. To facili- 
tate understanding, refer to the following figures. 



Endnode 




One or more GIDs per 
port. A GID is a valid 
IPv6 address. 



ne or more LIDs per 
lort, up to 2*-"*^ LIDs 



ne EUI-64 per HCA or 
CA. 

Single manufacturer 
assigned EUI-64 GUID 
rport. Additional SM 
assigned N-1 EUI-64 
per port - one per addi- 
tional GID. 

Gingle EUI-64 GUID per 
pwitch 



lOne LID per switch. 
P-MC = 0 

ingle GID per switch 



LIDs are unique only 
ithin a subnet 



Endnode 



Routers provide 
connectivity 
among subnets 




Subnet B 



Endn ode 



Endnode 




Point-to-point links may 
be used. The subnet man- 
ager assigns LID and GID 
addresses. 



Figure 37 Reference IBA Address / Component Association 
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Multi-protocol router con- 
tains IBA ports and non- 
IBA ports. 




IBA Router 




Control QPs 



f>ne EUI-64 GUID per 
outer port 



ne or more LIDs per 
outer port, up to 2'""^ 




jsequential LIDs 



One or more GIDs per 
port. A GID is a valid 
iPv6 address. 



Figure 38 Reference IBA Router Address Association 

4.1 Terminology And Concepts 

Unicast Identifier: An identifier for a single channel adapter or router 
port. A packet sent to an unicast identifier is delivered to the port identified 
by that identifier. IBA defines two unicast identifier - a global identifier 
(GID) - may be unique across subnets - and local identifier (LID) - unique 
only within a subnet). 

Multicast Identifier: An identifier for a set of destination ports on channel 
adapters or routers. A packet sent to a multicast identifier is delivered to 
all ports identified by that identifier. IBA defines two multicast identifiers - 
a global identifier (GID) used by applications to address a multicast group 
and route packets between subnets and a local identifier (LID) used to 
switch packets within a subnet. 

EUI-64: IEEE defined 64-bit identifier assigned to a device. The EUI-64 is 
a 64-bit identifier created by concatenating a 24-bit companyjd value and 
a 40-bit extension identifier. The companyjd is assigned by the IEEE 
Registration Authority; the extension identifier is assigned by the organi- 
zation with the assigned companyjd. 

• The universal / local bit in IEEE EUI-64 shall be set to one to indi- 
cate global scope or set to zero to indicate local scope. The man- 
ufacturer assigns an EUI-64 with global scope set. A SM may 
assign additional EUI-64 with local scope indicated. 

• For additional details, see: "Guidelines For 64-bit Global Identifier 
(EUI-64) Registration Authority"at vyww.standards.ieee.orq/re- 
aauth/oui/tutorials/EUI64.html 

GUID (Global Unique Identifier): A globally unique EUI-64 compliant 
identifier. 

C4-1: Each HCA, TCA, switch and router shall be assigned an EUI-64 
GUID by the manufacturer. 
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C4-2: Each port on a CA or router shall be assigned an EUI-64 GUID by 1 

the nnanufacturer. 2 

3 

Subnet Prefix: A 0 to 64-bit - as a function of scope - identifier used to ^ 
uniquely identify a set of links, channel adapter ports, and switches which 

are managed by a common subnet manager. ^ 

GID Prefix: A 64-bit identifier (upper 64-bits of a GID) created by concat- 7 

enating address scope bits, potentially a small number of "filler" bits, and 8 

potentially a subnet prefix - filler and subnet prefix presence is a function g 

of the address scope. ^ q 

11 

GID (Global Identifier): A 128-bit unicast or multicast identifier used to 
identify a port on a channel adapter, a port on a router, a switch, or a mul- 

ticast group. A GID is a valid 128-bit IPv6 address (per RFC 2373) with 13 

additional properties / restrictions defined within I BA to facilitate efficient 14 

discovery, communication, and routing. Note: These rules apply only to 15 
IBA operation and do not apply to raw IPv6 operation unless specifically 
called out. 

1 R 

C4-3: GIDs shall comply with the rules defined within 4.1.1 GID Usage 

and Properties on page 110 : ^9 

20 

4.1.1 GID Usage and Properties 21 

1 ) Each port on a CA or router shall be assigned at least one unicast 22 
GID. The first unicast GID assigned shall be created using the manu- 23 
facturer assigned EUI-64 identifier. This GID is referred to as GID 24 
index 0. 25 

2) A unicast GID shall be created using one of the following mecha- 26 
nisms: 27 

a) Concatenation of the default GID prefix (0xFE80::0) and any of 28 
the CA or router port's or the switch's assigned EUI-64 identifier 29 
(at any GID index) - this is referred to as a default GID. A packet 30 
containing a GRH with a GID with this prefix must never be for- 3^ 
warded by a router, i.e. it is restricted to the local subnet. ^2 

b) Concatenation of a subnet manager assigned 64-bit GID prefix 33 
and the CA or router port's or the switch's manufacturer assigned 34 
EUI-64 identifier. A subnet shall have only one assigned GID pre- ^5 
fix at any given time (at GID index 0). 

36 

c) Assignment of a GID by the subnet manager. The subnet manag- 37 
er creates a GID by concatenating the GID prefix with a set of lo- 
cally assigned EUI-64 values (at GID index 1 or above). 

All CA and router ports and switches must be assigned at least one 40 
unicast GID using either (a) or (b). CA and router ports may be as- 
signed additional unicast GIDs using (c). 
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3) Any QP in a CA, switch or router shall be addressable using the de- 
fault GID prefix (OxFE80::0) in addition to the assigned GID for that 
QP. This allows a subnet to transition from a default GID prefix state 
to a managed state without interrupting existing communication ses- 
sions. 

4) The maximum number (N) of unicast GIDs supported per CA or 
router port is implementation specific. The subnet manager may 
assign N-1 additional unicast GIDs. Each of these N-1 GIDs is 
created by concatenating one subnet manager assigned EUI-64 
identifiers (the local bit set) with the GID prefix. 

5) The unicast GID address 0:0:0:0:0:0:0:0 is reserved - referred to as 
the Reserved GID. It shall never be assigned to any channel adapter, 
switch, or router. It shall not be used as a destination address or in a 
global routing header (GRH). 

6) The unicast GID address 0:0:0:0:0:0:0:1 is referred to as the 
loopback GID and is only used by raw IPv6 services - it is not used by 
IBA transport services. It shall never be assigned to a channel 
adapter or router nor be present in any IBA packets. 

7) The unicast GID subnet prefix shall be limited to the upper 64-bits of 
the GID address space. The number of subnet prefix bits may further 
be limited by filler and scope bits - see below. 

8) The lower 64-bits of the unicast GID cannot be further partitioned into 
subnets. 

9) The lower 64-bits of a unicast GID shall be subnet unique. If the uni- 
versal / local bit is set to universal, then the assignment must be glo- 
bally unique. 

10) The GRH (Global Route Header) shall contain valid source and desti- 
nation GIDs. For raw IPv6 packets, the GRH is treated as an IPv6 
packet header with the source and destination addresses complying 
to RFC 2373. 

11) Unicast GID scoping shall be: 

a) Link-local - A unicast GID used within a local subnet using the de- 
fault GID prefix. Routers must not forward any packets with either 
link-local source or destination GIDs outside the local subnet. A 
link-local GID has the following format: 

1 0-bits 54-bits of EUI-64 / Assigned 

1111111010 0 ^^1^^ 



^1 1 






1 1 







Figure 39 Link-Local Unicast GID Format 
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b) Site-local - A unicast GID used within a collection of subnets 
which is unique within that collection (e.g. a data center or cam- 
pus) but is not necessarily globally unique. Routers must not for- 
ward any packets with either a site-local Source GID or a site- 
local Destination GID outside of the site. 



10-bits 



38-bits 



16-bit 
Subnet 



1111111011 


ofO 


Prefix 

1 — 1 


EUI-64 / Assigned Value 












lillili illill 


iiiii 

IHl. 







Figure 40 Site-Local Unicast GID Format 

c) Global - A unicast GID with a global prefix, i.e. a router may use 
this GID to route packets throughout an enterprise or internet. 
The global GID format is: 



64-bit Subnet Prefix 


1 EUI-64 / Assigned Value 











Figure 41 Unicast Global GID Format 

12) A multicast GID is an identifier for a group of ports on channel 
adapters and routers. The multicast GID format is: 

8-bits 4-bits 4-bits 



11111111 


Flags 


Scope 




Multicast Group Id 




1 II II II 1 




Iliii Hliii 




yy " ^, 







Figure 42 Multicast GID Format 

a) 8-bits of 11111111 at the start of the GID identifies this as being a 
multicast GID. 

b) Flags is a set of four 1 -bit flags: OOOT with three flags reserved 
and defined as zero ('0'). The T flag is defined as follows: 

vi) T = 0 indicates this is a permanently assigned (i.e. well- 
known) multicast GID. See RFC 2373 and RFC 2375 as refer- 
ence for these permanently assigned GIDs. 

vii) T = 1 indicates this is a non-permanently assigned (i.e. tran- 
sient) multicast GID. 
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c) Scope is a 4-bit multicast scope value used to limit the scope of 1 
the multicast group. The following table defines scope value and 2 
interpretation. 3 



Table 3 Multicast Address Scope 



4 

5 

?,^^P® Address Scope ^ 
Value ^ y 



Reserved 8 

9 

Unassigned 



Link-local 



— 10 

11 

Unassigned 1 2 



Unassigned 13 

14 

15 



Site-local 



Unassigned 



Unassigned 17 



Organization-local ^ ^ 

— 19 

_ 20 



Unassigned 



OxA Unassigned 21 



OxB Unassigned 22 

"~ 23 



OxC Unassigned 
OxD Unassigned 



— 24 

25 

OxE Global 26 



OxF Reserved 27 
28 

1 3) A CA or router may join zero, one or more multicast groups, i.e. a CA 29 
or router port may be assigned zero, one or more multicast GIDs. 30 

14) Multicast GIDs shall not appear as the source GID in the GRH. 31 

15) Multicast GID FF02:0:0:0:0:0:0:1 is the link-local multicast GID - a 
router should not route packets with this destination GID outside the 33 
local subnet. This GID is used as the destination address within the 34 
global router header (GRH) for communicating to a set of QPs partic- 35 
ipating within the all channel adapters multicast group. ALL 35 
CHANNEL ADAPTERS MULTICAST GROUP is used to implement a 37 
broadcast service to all channel adapters which are capable of partic- 
ipating in multicast operations. 

39 

16) IPv6 defines a set of reserved multicast addresses in RFC 2375 and 
RFC 2373. IBA, unless explicitly stated othenA^ise, shall not use these 

42 
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3) A unicast LID shall map to only one port on a CA or router or to 
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addresses for IBA multicast operations and defines them as reserved 1 
for raw IPv6 usage. 2 

4.1.2 Channel Adapter, Switch, and Router Addressing Rules 3 

C4-4: Channel Adapters, Switches, and Routers shall comply with the ad- ^ 
dressing rules defined within 4,1.2 Channel Adapter. Switch, and Router 5 
Addressing Rules on page 114 . 6 

7 

Addressing rules are: 8 

9 

10 

2) A CA or router port and switch port 0 shall support a range of LIDs as ii 
defined by a Base LID and an LMC. The LIDs shall be sequentially ^2 
ordered starting with a base LID plus (2 " ^) LIDs. The SM may 
program the LMC on a port to any value between 0 and 7, to allow 
use of multiple LIDs (1-128) in addressing the port. 

15 
16 
17 
18 
19 

5) Unicast GIDs shall be assigned to switch port 0 and on a per port 20 
basis for CAs and router. 21 

6) A multiport CA (and by definition, a router) may be attached to one or 22 
more subnets - a port shall only be attached to one subnet at a time. 23 

4.1.3 LOCAL Identifiers 24 

04-5: Local Identifiers (LIDs) shall comply with the rules defined within 
4.1.3 Local Identifiers on page 114 . 26 

27 

Local identifier (LID): A 16-bit identifier with the following properties: 28 

29 

1 ) A LID is assigned by the Subnet Manager (SM) and is subnet unique, 
i.e. it cannot be used to route between subnets. 

31 

2) The LID address space is divided into reserved, unicast and multicast 32 
address ranges. 22 

3) LIDs are contained within the LRH (Local Route Header). 34 

4) For a CA or router, a source LID (SLID) shall refer to the port that first 35 
injected the packet into the subnet. For a switch initiating a packet, 36 
the SLID shall be the LID associated with that switch. 37 

5) A SLID shall only be associated with a unicast address. 38 

39 

6) A unicast destination LID (DLID) shall refer to the destination port of a 
CA or router or to switch port 0. A multicast DLID refers to the set of 

41 
42 



InfiniBand^'^ Trade Association 



Page 114 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand^"" Architecture Release 1 .0 
Volume 1 - General Specifications 



Addressing 



October 24, 2000 
FINAL 



7) 



8) 



destination ports witliin tlie subnet participating in a given multicast 
group. 

If the destination endnode is not on the same subnet, the DLID shall 
refer to the router port responsible for forwarding the packet to the 
next hop to the destination endnode. 

From any point within a subnet, a given channel adapter or router 
port may receive packets through multiple physical paths within the 
subnet. Each physical path may be identified by one or more desti- 
nation LIDs. To facilitate multipath operation while minimizing 
channel adapter complexity, each CA and router port and switch port 
0 shall be assigned a base LID and a LID Mask Control (LMC) value 
by the subnet manager. The LMC is a 3-bit field which represents 
2LMC paths (maximum of 128 paths). During discovery, the subnet 
manager may determine the number of paths to a given port and will 
partition the 16-bit LID space to assign a base LID and up to 2'-'^*-' se- 
quential LIDs to each port. 



Subnet £ 




Channel 
Adapter A 



Channel 
Adapter C 



Four paths between channel adapters A and C. CA A is assigned a 
Base LID 4, LMC = 2. This translates to CA A being assigned LIDs: 
{4, 5, 6, 7}. CA C is assigned Base LID 8, LMC = 2. This translates 
into CA C being assigned LIDs: {8, 9, 10, 11}. 

Figure 43 Multipath Identification 
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9) The LID space is defined as follows: 1 

• LID 0x0000 is reserved. ^ 

3 

• LID OxFFFF is defined as a permissive DLID. The permissive 
DLID indicates that the packet is destined for QPO on the channel 
adapter or router port or switch which received it. LMC is not de- ^ 
fined for this address. 6 

• The unicast LID range is a flat identifier space defined as 0x0001 ^ 
to OxBFFF. 8 

9 



The multicast LID range is a flat identifier space defined as 
OxCOOO to OxFFFE. 
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Chapter 5: Data Packet Format 



5.1 Packet Types 



This chapter introduces the fields in the data packet. A brief description of 
each field is given including a definition, field size, and abbreviation. This 
chapter does not specify the details of each field, but only the general 
usage and layout of the fields. 

In addition to data packets, IBA defines link packets which are used for 
link-level flow control. The format of these link packets is described in 
7.9.4 Flow Control Packet on page 176 . 

In this specification, the term packet refers to data packets only (i.e. 
packet and data packet are synonymous). Where reference to link 
packets is intended, the full term link packet m\\ be used. 



Packets are the unit of transfer in IBA. As described in 3.3 Communica- 
tions Stack on page 62 messages are segmented into packets by the CAs 
for transmission across the IB fabric. 

Packets have the following attributes: 

• Indivisible unit of data transfer and routing 
Unit of acknowledgement 

Unit of segmentation and re-assembly for messages 
Unit of link-level flow control 



IBA Message (End to End) 



Message uaia 



IBA Data Packe 


1 1 ' 

' ' ' - ~ * • 
t (Routed unit of work) ' - - , . 




Kouting 
Header 
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Header 


KacKei h'ayioaa 


CRC 


Kouiing 
Header 
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Header 


KacKet Kayioaa 


CRC 



Figure 44 IBA Messages and Packets 

There are two general classes of transports used in Packets: 
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5.2 Data Packet Format 



IBA Packets have IBA defined transport headers, are routed on IBA 
fabrics, and use native IBA transport facilities. 

Raw Packets may be routed on IBA fabrics but do not contain IBA 
transport headers. From the IB point of view, these packets contain 
only IBA routing headers, payload and CRC. IBA does not define the 
processing of these packets above the link and network layers. The 
intent is that these packets can be used to support non-IBA trans- 
ports over an IB fabric. 



The overall data packet structure is shown in Figure 45 on page 119 . 
There are two routing headers that precede a transport header(s) and 
payload: 

The local route header is required on all packets 

• The global route header is required on all packets that are to be rout- 
ed to a different subnet, and on all multicast packets regardless of 
destination. 

• A global route header may be placed on any packet except subnet 
management packets. 

C5-1: Packets generated by an InfiniBand device shall conform to the 
packet structure defined in Figure 45 and to the packet header location 
and size requirements as defined in figure 46 

Each IBA packet ends with an invariant CRC followed by a variant CRC. 
Each raw packet ends with a variant CRC. 
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Figure 45 IBA Packet Overview 
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The IBA packet structure is shown in Figure 46 on page 120 . 
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Local Routing Header • LRH - 8 bytes 

Present in all packets of a message. 

Global Routing Header - GRH - 40 Bytes 

Present in all packets of nnessage, if indicated by Link Next 
Header field In LRH. 



Base Transport Header - BTH - 12 Bytes 

Present in all packets of message, If indicated by Link Next 
Header field (i.e.not a raw packet) 

Reliable Datagram Extended Transport Header - RDETH - 
4 Bytes; Present in every packet of reliable datagram mes- 
sage. 

Datagram Extended Transport Header - DETH - 8 Bytes 

Present in every packet of datagram request messages 

ROMA Extended Transport Header • RETH - 16Bytes 

Present in first packet of RDMA request message 

Atomic Extended Transport Header - AtomicETH - 28 Bytes 

Present In Atomic request message 



ACK Extended Transport Header - AETH • 4Bytes; 
Present in all ACK packets, including first and last packet of 
message for RDMA Read Response packets. 

Atomic ACK Extended Transport Header - 
AtomicAckETH - SBytes; 
Present in all AtomicACK packets. 

Immediate Date - ImmDt - 4 Bytes 

Present in last packet of request with immediate data. 

Payload - PYLD - 0-4096 Bytes 



Invariant CRC- ICRC - 32b 

Present in all packets of message, If indicated by Link Next 
Header field (i.e.not a raw packet). 

Variant CRC-VCRC-16b 
Present in all packets of message. 



Figure 46 IBA Packet Structure 
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5.2.1 Local Route Header (LRH) - 8 Bytes 

C5-2: Packets generated by an InfiniBand device shall conform to the 
packet header format for the LRH as defined in table 4. 

The Local Routing Header (LRH) contains fields used for local routing by 
switches within a IBA subnet. The following table summarizes the fields in 
the LRH.: 

Table 4 Local Route Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Virtual Lane 


VL 


4 


This field identifies the virtual lane that the packet is using. 


Link Version 


LVer 


4 


This field identifies the Link level protocol of this packet. This 
version applies to the general packet structure including the 

1 RH fiplri^ and thp variant CRC 


Service Level 


SL 


4 


This field indicates what service level the packet is request- 
ing within the subnet. 


Reserved 




2 


Transmitted as 0, ignored on receive. 


Link Next Header 


LNH 


2 


This field identifies the headers that follow the LRH. 


Destination Local 
ID 


DLID 


16 


This field identifies the destination port and path (data sink) 
on the local subnet. 


Reserved 




5 


Transmitted as 0, ignored on receive. 


Packet Length 


PktLen 


11 


This field identifies the size of the Packet in four-byte words. 
This field includes the first byte of LRH to the last byte before 
the variant ORG. See 7.7.8 Packet Lenath (PktLen) - 11 bits 
on oaae 160 for details on max and min values of PktLen 


Source Local ID 


SLID 


16 


This field identifies the source port (injection point) on the 
local subnet. 



The LRH fields are fully defined in 7.7 Local Route Header on page 158 . 

5.2.2 Global Route Header (GRH) - 40 Bytes 

C5-3: Packets generated by InfiniBand devices shall conform to the 
packet header format for the GRH as defined in table 5. 

Global Route Header (GRH) contains fields for routing the packet be- 
tween subnets. The presence of the GRH is indicated by the Link Next 
Header (LNH) field in the LRH. The layout of the GRH is the same as the 
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IPv6 Header defined in RFC 2460. The following table summarizes the 
fields in the GRH. 

Table 5 Global Route Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


IP Version 


IPVer 


4 


This field indicates version of the GRH 


Traffic Class 


TCIass 


8 


This field is used by IBA to communicate global service 
level. 


Flow Label 


Flow- 
Label 


20 


This field identifies sequences of packets requiring special 
handling. 


Payload length 


PayLen 


16 


This field indicates the length of the packet in bytes following 
the GRH. This includes from the first byte after the end of the 
GRH up to but not including either the VCRC or any padding 
to achieve 4 byte length. For raw packets with GRH, the pad- 
ding is determined from the lower two bits of this GRH:Pay- 
load length field. (For IBA packets it is determined from the 
pad field in the transport header.) Padding is placed immedi- 
ately before the VCRC field. 

Note: GRH:PayLen is different from LRH:PkyLen. 


Next Header 


NxtHdr 


8 


This field identifies the header following the GRH. This field 
Is included for compatibility with IPV6 headers. It should indi- 
cate IBA transport. 


Hop Limit 


HopLmt 


8 


This field sets a strict bound on the number of hops between 
subnets a packet can make before being discarded. This Is 
enforced only by routers. 


Source GID 


SGID 


128 


This field Identifies the Global Identifier (GID) for the port 
which injected the packet into the network. 


Destination GID 


DGID 


128 


This field Identifies the GID for the port which will consume 
the packet from the network. 



5.2.3 Base Transport Header (BTH) - 12 Bytes 

C5-4: Packets generated by an Infiniband device shall conform to the 
packet header format for the BTH as defined in table 6. 

Base Transport Header (BTH) contains the fields for IBA transports. The 
presence of BTH is indicated by the Next Header field of the last previous 
header (i.e either LRH:LNH or GRH:NextHdr depending on which was the 
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last previous header). The following table summarizes the fields in the 
BTH.: 

Table 6 Base Transport Header Fields 



Field Name 


rield 

Abbrev 

iation 


rield 
Size 
(in 
bits) 


Description 


Opcode 


OpCode 


8 


This field indicates the IBA packet type. The OpCode also 
specifies which extension headers follow the Base Transport 
Header 


Solicited Event 


SE 


1 


This bit indicates that an event should be generated by the 
responder. 


MigReq 


M 


1 


This bit is used to communicate migration state. 


Pad Count 


PadCnt 


2 


This field indicates how many extra bytes are added to the 
payload to align to a 4 byte boundary. 


Transport Header 
Version 


TVer 


4 


This field indicates the version of the IBA Transport Headers. 


Partition Key 


P_KEY 


16 


This field indicates which logical Partition is associated with 
this packet (see 10.9 Partitionina on oaae 427) 


Reserved (variant) 




8 


Transmitted as 0, ignored on receive. This field is not 
included in the invariant CRC. see 7.8 CRCs on paae 161 for 
details. 


Destination QP 


DestQP 


24 


This field indicates the Work Queue Pair Number (a.k.a. QP) 
at the destination 


Acknowledge 
Request 


A 


1 


This bit is used to indicate that an acknowledge (for this 
packet) should be scheduled by the responder. 


Reserved 




7 


Transmitted as 0, ignored on receive. This field is included in 
the invariant CRC. 


Packet Sequence 
Number 


PSN 


24 


This field is used to detect a missing or duplicate Packet. 
See 9.7.1 Packet Seauence Numbers (PSN) on paae 240 
for a detailed description of PSN. 



The detailed definition of the Base Transport Header fields are defined in 
Section 9.2 on page 199 . 

5.2.4 Reliable Datagram Extended Transport Header (RDETH) - 4 Bytes 

o5-1 : Packets generated by an Infiniband device that supports reliable da- 
tagrams shall conform to the packet header format for the RDTH header 
as defined in table 7. 

Reliable Datagram Extended Transport Header (RDETH) contains the ad- 
ditional transport fields for reliable datagram service. The RDETH is only 
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in Reliable Datagram packets as indicated by the Base Transport Header 
Opcode field. The following table summarizes the fields in the RDETH.: 

Table 7 Reliable Datagram Extended Transport Header Fields 



Field Nanfie 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Reserved 




8 


Transmitted as 0. ignored on receive. 


EE-Context 


EECnxt 


24 


This field indicates which End-to-End Context should be 
used for this Reliable Datagrann packet 



The detailed definition of the Reliable Datagram Extended Transport 
Header is in Section 9.3.1 Reliable Datagram Extended Transport Header 
(RDETH) - 4 Bvtes on pace 203 . 

5.2.5 Datagram Extended Transport Header (DETH) - 8 Bytes 

C5-5: Packets generated by an Infiniband device shall conform to the 
packet header format for the DETH as defined in table 8. 

Datagram Extended Transport Header (DETH) contains the additional 
transport fields for datagram service. The DETH is only in datagram 
packets if indicated by the Base Transport Header Opcode field. The fol- 
lowing table summarizes the fields in the DETH.: 

Table 8 Datagram Extended Transport Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Queue Key 


Q_Key 


32 


This field is required to authorize access to the receive 
queue. 


Reserved 




8 


Transmitted as 0, ignored on receive. 


Source QP 


SrcQP 


24 


This field indicates the Work Queue Pair Number (a.k.a. QP) 
at the source. 



The detailed definition of the Datagrann Extended Transport Header is in 
Section 9.3.2 Datagram Extended Transport Header (DETH) - 8 Bvtes on 
pace 204 . 

5.2.6 RDM A Extended Transport Header (RETH) - 16 Bytes 

o5-2: Packets generated by an Infiniband device that supports RDMA op- 
erations shall conform to the packet header format for the RETH as de- 
fined in table 9. 
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RDMA Extended Transport Header (RETH) contains the additional trans- 
port fields for RDMA operations. The RETH is present in only the first (or 
only) packet of an RDMA Request as indicated by the Base Transport 
Header Opcode field. The following table summarizes the fields in the 
RETH.: 

Table 9 RDMA Extended Transport Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Virtual Address 


VA 


64 


This field is the Virtual Address of the RDMA operation. 


Remote Key 


R_Key 


32 


This field is the Remote Key that authorizes access for the 
RDMA operation. 


DMA Length 


DMALen 


32 


This field indicates the length (in Bytes) of the DMA opera- 
tion. 



The detailed definition of the RDMA Extended Transport Header is in 
9.3.3 RDMA Extended Transport Header (RETH) - 16 Bvtes on page 205 . 

5.2.7 Atomic Extended Transport Header (AtomicETH) - 28 Bytes 

o5-3: Packets generated by an Infiniband device that supports atomic op- 
erations shall conform to the packet header format for the AtomicETH 
header as defined in Table 10. 

Atomic Extended Transport Header (AtomicETH) contains the additional 
transport fields for Atomic packets. The AtomicETH is only in Atomic 
packets as indicated by the Base Transport Header Opcode field. The fol- 
lowing table summarizes the fields in the AtomicETH.: 

Table 10 Atomic Extended Transport Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Virtual Address 


VA 


64 


This field is the remote virtual address. 


Remote Key 


R_Key 


32 


This field is the Remote Key that authorizes access to the 
remote virtual address. 


Swap (or Add) 
Data 


SwapDt 


64 


This field is an operand in atomic operations. 


Compare Data 


CmpDt 


64 


This field is an operand in CmpSwap atomic operation. 
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The detailed definition of the Atomic Extended Transport Header is in Sec- 
tion 9.3.5 ACK Extended Transport Header (AETH) - 4 Bvtes on page 
207 ). 

5.2.8 ACK Extended Transport Header (AETH) - 4 Bytes 

C5-6: Packets generated by an Infiniband device shall conform to the 
packet header format for the AETH as defined in table 11 . 

ACK Extended Transport Header (AETH) contains the additional trans- 
port fields for ACK packets. The AETH is only in Acknowledge packets as 
indicated by the Base Transport Header Opcode field. The following table 
summarizes the fields in the AETH. 

Table 11 ACK Extended Transport Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Syndrome 


Syn- 
drome 


8 


This field indicates if this is an ACK or NAK packet plus addi- 
tional information about the ACK or NAK. 


Message 
Sequence 
Number 


MSN 


24 


This field indicates the sequence number of the last mes- 
sage completed at the responder. 



The detailed definition of the ACK Extended Transport Header is in Sec- 
tion 9.3.5 on page 207 . 

5.2.9 Atomic ACK Extended Transport Header (AtomicAckETH) - 8 Bytes 

o5-4: Packets generated by an Infiniband device that supports atomic op- 
erations shall conform to the packet header format for the AtomicAckETH 
as defined in table 12. 

Atomic ACK Extended Transport Header (AtomicAckETH) contains the 
additional transport fields for AtomicACK packets. The AtomicAckETH is 
only in Atomic Acknowledge packets as indicated by the Base Transport 
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Header Opcode field. The following table summarizes the fields in the 
AtomicAckETH.:. 

Table 12 Atomic ACK Extended Transport Header Fields 



Field Name 


Field 

Abbrev 

iation 


Field 
Size 
(in 
bits) 


Description 


Original Remote 
Data 


Orig- 
RemDt 


64 


This field is the return operand in atomic operations and con- 
tains the data in the remote memory location before the 
atomic operation. 



The detailed definition of the Atomic ACK Extended Transport Header is 
in Section 9.3.5.3 on page 207 . 

5.2.10 Immediate Data Extended Transport Header (ImmDt) - 4 Bytes 

Immediate DataExtended Transport Header (ImmDt) contains the addi- 
tional data that is placed in the receive Completion Queue Element 
(CQE). The ImmDt is only in Send or RDMA-Write packets with Immediate 
Data if indicated by the Base Transport Header Opcode. 

The detailed definition of the Immediate Data Extended Transport Header 
is in Section 9.3.6 on page 208 . 

Note, the terms Immediate Data Extended Transport Header and Imme- 
diate Data are used synonymous in this specification. 



5.2.11 Payload 



Payload (PYLD) contains the application data being transferred end to 
end. Payload is not present in RDMA Read Requests, Acknowledge, 
CmpSwp, FetchAdd, and Atomic Acknowledge packets. It is optionally 
present in the other packet op-codes. 

C5-7: The length of the Payload shall be 0 or more bytes up to the full path 
MTU. 

05-8: All packets of an IBA message that contain a payload shall fill the 
payload to the full path MTU except the last (or only) packet of the mes- 
sage. 

05-9: In a packet using InfiniBand transport, a Pad field of 0-3 bytes shall 
be included in the packet and used to align the Payload to a multiple of 4 
bytes (i.e. the size of the Payload plus the Pad field is always a multiple 
of four bytes). The actual size of the Pad field used in a given packet shall 
be indicated in the Base Transport Header PadCnt field of the packet. 
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5.2.12 Invariant CRC 



5.2.13 Variant CRC 



5.3 Raw Packet Format 



Invariant CRC (ICRC) covers the fields that do not change in a message 
from source to destination. ICRC is only in IBA packets, and is not present 
in Raw Packets. Which fields are covered in the ICRC is dependent on the 
presence of the GRH. 

The detailed definition of the Invariant CRC is in Section 7.8.1 on page 
161. 



Variant CRC (VCRC) covers the fields that can change from link to link. 
The VCRC is in all packets, both IBA and Raw Packets. The VCRC can 
be regenerated in the fabric. 

The detailed definition of the Variant CRC is in Section 7.8.2 on page 163 . 



A Raw Packet is a packet that does not use IBA transport. Raw packets 
are not a required feature of InfiniBand devices, but if they are supported, 
the raw packet shall be formatted as specified in this section. 

o5-5: If a Raw packet contains a Global Routing Header, the packet struc- 
ture shall be: LRH, GRH, Payload (including any transport headers), and 
VCRC. If a Raw packet does not contain a GRH, then the structure shall 
be: LRH, RWH, Payload, and VCRC. 

o5-6: The RWH is a 32 bit "Raw Header" that shall contain the EtherType 
of the payload. EtherType indicates the protocol of the raw packet and 
shall conform to the definition in the IEEE Type Field Registrar. (See stan- 
dards IEEE 802.3, 1998 Clause 3.2.6 Length/Type Field specifications 
and IEEE 802.1H-1995 for use of the Type Field.) 

This format of "Raw" packets is shown in Figure 45 on page 119 . 

o5-7: The length of a raw packet (from after the RWH to before the variant 
CRC) must be a multiple of 4 bytes. 

o5-8: The format of the Raw Header shall be as is shown in Figure 47. 
Figure 47 Raw Header (RWH) 
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5.4 Packet Examples 



Some examples of IBA packets are shown in Figure 48. 
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Figure 48 IBA Packet Examples 
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Chapter 6: Physical Layer Interface 



6.1 Overview 



This chapter describes services provided by the physical layer to the link 
layer and the logical interface between these layers. The physical layer 
also has an interface to management which is not covered in this chapter. 

The description of the physical layer is provided in Volume 2, the electro- 
mechanical specification 



Link Layer - Link-technology-independent Logic 
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Physical Layer - Link-Technology-dependent Functions 
Link Width support, data encoding, voltage, packet framing 



Figure 49 Physical Functions and Physical/Link Interface 



6.2 Services provided by the Physical Layer. 

The physical layer is responsible for: 



• establishing a physical link when possible, 

informing the link layer whether the physical link is up or down, 

• monitoring the status of the physical link, and 

• when the physical link is up: 

• delivering received control and data bytes to the link layer, and 

• transmitting control and data bytes from the link layer. 
See volume 2 for physical layer specifications. 
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6.3 Interface between physical and Link Layers. 1 

This chapter does not intend to describe an actual interface within a chip 2 

- it describes the functionality of the interface between the link-technology- 3 

dependent physical send and receive functions, and the link-technology- 4 

independent link logical function. 5 

6 

This interface is designed to keep the link and higher layer interface inde- ^ 
pendent of physical layer implementation. The physical layer deals with all 
details that are dependent on the characteristics of operation over a par- ^ 
ticular physical layer such as line code. 9 

10 

The purpose of describing a logical interface and the related state ma- ii 
chines is to partition functions to describe external behavior of IBA de- 1 2 
vices as simply and clearly as possible. Such descriptions are not 
intended to imply details of the internal implementation of devices. For in 
stance, the interface described here does not imply the width of the in- 
ternal link path which will be implementation dependent. ^5 

16 

6-3.1 Interface between physical receive and link receive. i7 

The following messages are sent between the physical receive function 18 
and the link logic. 19 

20 

6.3.1.1 Phy_link - Physical Link Status 21 

This message conveys the status of the physical link from the physical re- 22 
ceive function to the link logic. This message is sent when physical link 23 
status changes and can take the following values: 24 

25 
26 



up the physical link is trained and operational 



13 
14 



down the physical link is not operational. Sent when the link is in 
any non-operational status Including no receive signal or 
retraining in progress 27 



28 
29 
30 
31 



These values report the status of the physical link as needed by the link 
logic. Any finer grain information needed by management (e.g. no_signal 
or retraining) will be obtained by management from the physical layer ^2 
rather than passed through the link layer. 33 

34 

6.3.1.2 LJnit_Train - Link Initiate Retraining 35 

This message is a request for retraining of the physical link. It is sent from 36 
the link logic to the physical receive function when the link logic has de- 37 
tected a need to retrain the link. See Section 7.12.2. "Error Recoverv Pro- 
cedures." on page 187 for usage of this message. 

40 
41 
42 
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6.3.1.3 RCV_STREAM - RECEIVE STREAM 1 

This message conveys the control and data stream decoded by the re- 2 

ceiver from the physical receive function to the link logic. This message is 3 

sent once for each data byte and once for each control signal received. 4 

The idle signaling of the physical link is treated as one control signal. This 5 

message can take the following values: g 

7 
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17 

6.3.2 Interface between physical Transmit and link Transmit. 18 

The following messages are sent between the physical transmit function 19 
and the link logic. 20 

21 

6.3.2.1 Xmit_stream - Transmit Stream 22 

This message conveys the control and data stream from the link logic to 23 
the physical layer. This message is sent once for each data byte and once 24 
for each control signal to be sent. The idle message causes the physical 
send function to send idles until a new message is received. This mes- 
sage can take the following values: 

27 
28 
29 
30 
31 
32 
33 
34 
35 
36 

6.3.2.2 Xmit_Ready - Physical Transmitter Ready 37 

This message is sent from the physical transmit function to the link trans- 
mitter to indicate whether the physical transmit function is ready to start 39 
transmitting a new packet. This provides physical layer dependent pacing 40 
back to the link layer since many physical layers have constraints that pre- 41 

42 
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vent sending continuous packet traffic. This message can take the fol- 1 
lowing values: 2 

3 

rdy ready for packet initiation 4 



wait hold off packet initiation 
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Chapter 7: Link Layer i 

2 

3 
4 



7.1 Overview 



"Or" is represented by "+". 



10 
11 



7 
8 

This chapter describes the behavior of the link and specifies the link level g 
operations for devices attached to an IBA network. The link layer handles 
the sending and receiving of data across the links at the packet level. Ser- 
vices provided by the link layer include addressing, buffering, flow control, 
error detection and switching. 12 

13 

State machines are used in this specification to define the logical opera- ^4 
tion of the link layer as externally visible. They are not intended to define ^ 5 
internal details of implennentation. For instance, the packet receiver state ^ g 
machine operates on data received from the link layer as a stream of 
bytes though it is expected that many implementations of the link layer will ^ ^ 
process multiple bytes of the data stream in parallel. 1 8 

19 

7.1.1 State Machine Conventions 20 

State machines are described to provide a clear description of the external 2 1 

behavior of the devices. Their description is not intended to imply the in- 22 

ternal implementation of IBA devices. Actual implementations must take 23 

into account other considerations such as efficiency and suitability to the 24 

implementation technology. 25 

The state machines in this chapter use the following conventions: 

27 

Each state is represented by a box, 28 

• The top section of the box contains the state name. 

• The bottom section of the box contains the actions which occur in 
the state. 



30 
31 
32 
33 
34 
35 



Transition arrows indicate state transitions which will be made 
when the expression next to the arrow is satisfied. 

A transition arrow which does not originate in a state indicates a 
global transition. Such a transition will occur regardless of the 
current state. For instance, in Figure 50 on page 136 . there is a 
global transition into the LinkDown state. 37 

If no exit condition for a state is satisfied, the machine remains in 
the current state. 39 

40 
41 

"And" is represented by 
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7.2 Link States 



7.2.1 LinkDown State 



7.2.2 LiNKlNITIALIZE STATE 



7.2.3 LinkArm State 



7.2.4 LinkActive State 



7.2.5 LinkActDefer State 



The state diagrams represent the primary specification for the 
functions they depict. When a conflict exists between a state dia- 
gram and descriptive text, the state diagram takes precedence. 



C7-1: A port shall control its state and overall operation as specified in 
Figure 50 Link State Machine on page 136 and Section 7.2.7. "State Ma- 
chine Terms." on page 137 . 

The states Linklnitialize and LinkArm are used by subnet management to 
configure devices on the subnet. Refer to 14.3.5 Port State Change on 
page 652 for additional information on how these states are used. 

The link state machine is depicted in Figure 50. The following is a descrip- 
tion of the states of this state machine. 



In the LinkDown state, the physical link is not up (that is, the physical layer 
is sending phy_link=down to the link layer) and the link layer is idle. In this 
state the link layer discards all packets presented to it for transmission. 



In the Linklnitialize state, the physical link is up (that is, the physical layer 
is sending phy_link=up to the link layer) and the link layer can only receive 
and transmit subnet management packets (SMPs) and flow control link 
packets. While in this state, the link layer discards all other packets re- 
ceived or presented to it for transmission. 



In the LinkArm state, the physical link is up and the link layer can receive 
and transmit SMPs and flow control link packets. Additionally, the link 
layer can receive all other packets but discards all non-SMP data packets 
presented to it for transmission. 



In the LinkActive state, the physical link is up and the link layer can 
transmit and receive all packet types. 



The LinkActDefer state is entered from the LinkActive state when the 
physical layer indicates a failure in the link. If the error persists, the Link- 
DownTimeout expires and the port state transitions to LinkDown state. If 
the physical layer recovers prior to LinkDownTimeout expiration, the port 
state machine returns to the LinkActive state. While in the LinkActDefer 
state, the link layer will not transmit or receive packets. It may process 
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Figure 50 Link State Machine 
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The purpose of this state is to allow for retraining of the physical link 1 
without requiring reinitialization of the link and higher layers. 2 

3 
4 

Management can send commands to attempt to alter the link state by 5 
sending a set request to the link port state in Portlnfo. Only values of g 
Down, Arm and Active are valid for such set requests. Commands to ^ 
change state to Arm or Active are only valid when they appear as an exit 
term for the current state. ^ 

9 

C7-2: Any management state change command with a value other than 10 
Down, Arm, or Active shall not result in a state change. 11 

12 

C7-3: A management state change conimand which is not valid in the cur- ^ ^ 
rent state shall not result in a state change. 

14 

For instance. Active is only valid when the current state is LinkArm. If the ^ ^ 
command is not valid for the current state, it will not cause a state change. 1 6 

7.2.7 State Machine Terms 18 

Reset - An internal signal to reset the interface. 19 

20 

Remotejnit - a link packet with the flow control initialize Op code (see 21 
7.9.4 Flow Control Packet on page 176 ) has been received and has 22 
passed the checks of the link packet check state machine. ^3 

Active_enable - a flag to prevent a premature transition from armed to ac- 24 

five. It is set to false when the Linklnitialize state is exited. It is set to true 25 

when a link packet with the normal flow control Op code has been re- 26 

ceived and has passed the checks of the link packet check state machine 27 

while in the LinkArm state. 28 

29 

PhyUnk - the physical link status, phyjink, from the physical layer (refer 
to 6.3.1.1 Phv link - Phvsical Link Status on page 131 ). Valid values are 
Up and Down. 31 

32 

PortState - the value of the PortState component of the Portlnfo attribute. 33 
(Refer to 14.2.5.6 Portlnfo on pace 633 .) Valid values are "Down", "ini- 34 
tialize", "Arm", "Active", and "ActiveD". 3^ 

CPortState - a value that indicates commands from management to 

change the port state. Valid values are "Down", "Arm", and "Active". Note 3^ 

that when phy_link=up and CPortState=down, the state machine will tran- 38 

sition to the LinkDown state which will reset other link state machines. 39 

Since phyjink=up, this will be followed by a transition to the Linklnitialize 40 

state. Thus a command to change link port state to down provides a way 4^ 
to re-initialize the link layer. To disable a port requires a command to the 
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physical layer port state machine. The value of CPortState shall only per- 1 

sist while in the state where it was received. If it satisfies a transition term 2 

from that state, it shall cause the transition. If it does not. it shall cause no 3 

transitions. Any state transition clears CPortState. 4 

DataPktXmitEnable - a Boolean that indicates the link layer's action with 

respect to transmission of non-SMP data packets. When True, transmis- ^ 

sion of non-SMP data packets is enabled. When False, non-SMP data 7 

packets submitted to link layer for transmission are discarded. 8 



The packet receiver's primary input is the rcv_stream (refer to 6.3.1.3 
rev stream - Receive Stream on oaoe 132 ). 



9 
10 
11 



DataPktRcvEnable - a Boolean that indicates the link layer's action with 
respect to reception of non-SMP data packets from the physical layer. 
When True, reception of non-SMP data packets is enabled. When False, 
non-SMP data packets received from the physical layer are discarded. 

13 

SMPEnable - a Boolean that indicates the link layer's action with respect 14 
to transmission and reception of subnet management packets (SMPs). 1 5 
When True, transmission and reception of SMPs are enabled. When 
False, SMPs submitted to link layer for transmission or reception are dis- 
carded. 

LinkPktEnable - a Boolean that indicates the link layer's action with re- ^9 
spect to transmission and reception of link packets. When True, transmis- 20 
sion and reception of link packets are enabled. When False, link packets 21 
are not generated by the link layer and any link packets received are dis- 22 
carded. 23 

24 

AcitveTrigger - a device dependent trigger that initiates the transition from 
LinkArm to LinkActive. For routers and channel adapters, ActiveTrigger 
occurs upon reception of a non-VL15 packet which passes the VCRC 26 
check on the port. For switches, ActiveTrigger occurs upon reception of a 27 
non-VLI 5 packet which passes the VCRC check on any port of the switch. 28 

29 

LinkDownTimeout - a timeout that indicates that the physical link has been 
down (PhyLink = down) for a period of time that causes the port state ma- 
chine to transition to the LinkDown state. LinkDownTimeout occurs when 
the port state machine has continuously been in the LinkActDefer state for 22 
10ms +3%/ -51%. 33 

34 

7.3 Packet Receiver States 35 

C7-4: Whenever the physical link is up, the packet receiver shall process 36 
the received stream from the physical layer as defined in Figure 51 Packet 37 
Receiver State Machine on page 140 . 38 



39 
40 
41 
42 
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The packet receiver monitors the received stream from the physical layer, 1 

rcv_stream, and passes any packets received with proper delimiters and 2 

no code violations to the link packet check or the data packet check as ap- 3 

propriate. Each byte of the rcv_stream is tested once by the state machine ^ 
and causes at most one state transition. For example, when an SLP 
causes a transition from RcvDataPacket to BadPacket, that SLP does not 

cause a further transition from BadPacket to RcvLinkPacket. ^ 

7 

While this logical state machine represents sending the whole packet to 8 

the packet checker once the end delimiter is received, implementations 9 
are allowed to begin processing the packet before that has occurred. 

Switches and routers may begin to forward a data packet while in the Rev- ^ 
DataPacket state if the packet passes all checks of the Data Packet Check 

state machine which require discard of the packet on failure. The required ^ ^ 

checks are all based on fields within the LRH. If further processing of the 13 

packet results in a transition to the MarkedBadPacket or BadPacket states 1 4 

and the switch or router has begun forwarding the packet, the switch or 15 

router shall corrupt the packet. -j 5 

17 

C7-5: To corrupt a packet, a switch or router shall place the 1's comple- 
ment of the VCRC calculated for the transmitted packet in the VCRC field ^ 
and shall terminate the packet with the EBP delimiter. 19 

20 

o7-1 : When corrupting a packet, the switch or router may truncate the 21 
packet rather than sending all the received bytes. 22 

23 

C7-6; If a switch or router is forwarding a corrupted packet which is longer ^4 
than indicated by the packet length field of the LRH, then it shall truncate 
the packet to less than or equal to the packet length field value. 

26 

C7-7: A CA shall not deliver a received packet to its client unless it has 27 
passed all the checks of the packet receiver and data packet check state 28 
machines.Therefore, when the action in the state is "discard or corrupt," a 29 
CA shall discard the packet 2o 

Packets received with one or more bytes of rcv_stream=error are dis- 
carded. Packets received without proper start and end delimiters are dis- ^2 
carded. These packets indicate an error occurring on the local link and 33 
cause entry to the bad packet state. Packets received with no bytes of 34 
rcv_stream=error, a data packet start delimiter (SDP), and a bad packet 35 
end delimiter (EBP) indicate a packet forwarded by a switch that experi- 
enced an error that was not on the local link. These packets cause entry 
to the marked bad pkt state. Since link packets are not forwarded by 
switches and routers, they should never have a bad packet end delimiter. 
A packet with a start delimiter of SDP and an end delimiter of EBP is con- 39 
sidered a local link error and causes entry to the bad packet state. 40 

41 
42 
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Figure 51 Packet Receiver State Machine 
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7.4 Data Packet Check 



The data packet check machine in a CA verifies a data packet before 
passing it to the network layer. The data packet check machine in a switch 
or router port verifies a received data packet. 

C7-8: Data packets shall be checked as specified by Figure 52 Data 
Packet Check machine on page 142 and Section 7.4. "Data Packet 
Check." on page 141 . The order of checks within this state machine indi- 
cates the precedence of the errors for reporting and not necessarily the 
order in which the errors are detected. 

For instance, most Implementations would detect an invalid VL shortly 
after the packet starts and a CRC error cannot be detected until the end 
of the packet. However, CRC error is checked first in the state machine 
because if both of these errors occur, the CRC error indicates that the 
packet was damaged and that error should be reported rather than the VL 
error. 

07-9: A switch or router shall perform the same checks as a CA on 
packets for which the switch or router is the destination such as manage- 
ment packets addressed to the switch or router. 

The data packet check machine in a CA passes packets to the receiver 
queueing. See Section 18.2.5.2 Receiver Queuing on page 828 . 

07-1 0: If a packet fails any test that terminates in a state of the Data 
Packet Check State Machine with the action "discard," switches, routers, 
and CAs shall discard the packet. 

07-11 : For packets that only fail tests terminating in states of the Data 
Packet Check State Machine that specify the action of"corrupt or discard," 
a CA shall discard the packet and a switch or router shall discard the 
packet or corrupt it as defined in Section 7.3. "Packet Receiver States." on 
page 138 . 
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Figure 52 Data Packet Check machine 
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The link layer of a switch or router is only required to check ICRC on 1 

packets that are destined to that switch or router. On all other packets, 2 

a switch or router may omit the ICRC check by returning ICRC_check 3 

= good without checking the ICRC. 4 

xport 5 

6 

IB LNH indicates IB transport -j 

raw LNH indicates raw transport 8 

9 

lver_check ^0 

11 

good LVer equals 0x0 1 2 

13 
14 
15 

djength_check 

17 
18 
19 
20 
21 

Received length is the number of bytes between the SDP and EGR 22 
MTU is Portlnfo.MTUCap. Minimum length is 5 for raw packets and 6 23 



bad otherwise 



good PktLen * 4 = received data bytes - 2 and 

(MTU +124)/4 >= PktLen > = nninimum length 

bad otherwise 



for IBA transport packets. See Section 7.7.8. "Packet Length (PktLen) 
- 11 bits." on oaae 160 . 

dlid check 



24 
25 
26 
27 

valid for OAs: DLID is a unicast LID of this CA or multicast LID configured 

for this CA 28 

for switches and routers: DLID is not 0x0000. 29 

30 



invalid othen/vise 
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cast) 
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7.5 Link Packet Check 



GRH VL15 check 



valid 
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buffer 
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ovflow 



(VL <> 15) or (GRH is not present in the packet) 
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buffer is available for the packet 
otherwise 



The only type of link packet currently defined is the flow control packet. 
See Section 7.9.4. "Flow Control Packet." on page 176 . 

C7-12: A port shall verify a link packet as specified by Figure 53 Link 
Packet Check machine on pace 145 and Section 7.5. "Link Packet 
Check." on page 144 before passing it to the flow control. 
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Figure 53 Link Packet Check machine 
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f_length_check 

good 
bad 



length received = 6 bytes (including LPCRC) 
otherwise 



7.6 Virtual Lanes Mechanisms 



Virtual lanes (VLs) provide a means to implement multiple logical flows 
over a single physical link. Link level flow control can be applied to one 
lane without affecting the others. Table 13 on page 146 summarizes the 
key attributes of VLs. 

C7-13: An InfiniBand protocol aware device shall conform to the require- 
ments defined by the rows labeled required VLs, buffermg, and ordering 
in Table 13. 

o7-2: An InfiniBand protocol aware device that implements more than one 
data VL shall conform to the requirements defined by the row labeled flow 
control in Table 13. 



Table 13 : Key Virtual Lane Characteristics 



Attribute 


Description 


VL 


Represents a logical flow over a given physical link. 


VL Types 


There are two types of VLs, one for nornnal traffic called 




a data VL and one reserved for subnet management 




traffic. The subnet management traffic VL is VL15. All 




other VLs are for normal traffic. 



Required VLs 



VL 15 shall be implemented in all IBA channel adapt- 
ers, switches, and routers. 

VL 0 shall be implemented for application use in all IBA 
channel adapters, switches, and routers. 
VLs 1-14 may be implemented to support additional 
traffic segregation. If implemented, VLs shall be num- 
bered as indicated in Table 14 VL Numbering and Inter- 
operability on page 148 



Buffering 



Devices shall provide independent buffering resources 
for each VL. See 7.6.4 Buffering and Flow Control For 
Data VLs on page 149 for details. 
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Table 13 : Key Virtual Lane Characteristics 



Attribute 



Description 



Flow Control 



Link-level flow control shall be implemented on a per 
VL basis. See 7.9 Flow Control on pace 175 for 
description of flow control on data VLs. 
VL 15 does not use link-level flow control, however 
See 7.6.3 Special VLs on pace 148 for details. 
Flow control packets are not subject to flow control. 



VL Field 



4-bit field within the LRH indicating the actual VL being 
used by this packet. 



SL Field 



4-bit field located in the LRH indicating the requested 
service level within the local subnet. 

See 7.6.5 Service Level on page 150 for a description 
of this field. 



Ordering 



When fabric configuration is stable, unicast packets 
between the same source and destination LIDs within a 
subnet and using the same SL shall be ordered. Multi- 
cast packets shall also be similarly ordered. Note, how- 
ever, that ordering is not guaranteed between unicast 
and multicast flows, even if on the same SL. 
Ordering is not maintained between different SLs. 
Packets on one SL may overtake packets on another 
SL, even if flowing through the same physical path 
within the fabric. 



7.6.1 VL IDENTIFICATION 



07-14: The sending port of an InfiniBand protocol aware device shall 
identify each packet with the virtual lane to be used, this information being 
carried in the 4-bit VL field of link header. In addition, the local routing 
header shall contain a 4-bit Service Level (SL). 

The use of the SL field is described in Section 7.6.5 on page 150. 



7.6.2 Number of VLs supported 



C7-15: An InfiniBand protocol aware device shall confornn to requirements 
defined by the rows labeled VL numbering and configuration in Table 14 

C7-16: All ports of an Infiniband protocol aware device shall support 
VL15. Further, all ports shall support data VLO. 

o7-3: Ports may support more than one data VL. If they do, they shall do 
in accordance with the allowed number specified in Table 14 on pace 148 . 
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7,6.3 Special VLs 



C7-17: The data VLs shall be numbered sequentially starting from zero. 

Thus, if an implementation supports 4 data VLs, they shall be numbered 
0, 1,2 and 3. 



Table 14 VL Numbering and Inter-operability 



Number of Data 
VLs Supported 


VL Numbering 


List ofVL 
Configurations 
That Shall Be 
Supported^ 


1 


VLO 


1 


2 


VLO, VL1 


2, 1 


4 


VLO - VLS 


4, 2. 1 


8 


VLO - VL7 


8. 4, 2, 1 


15 


VL0-VL14 


15, 8. 4, 2,1 



a. Because the port at the other end of the link may 
support a different number of VLs, the port must 
support operation with different numbers of VLs. 



VL 15 is a special VL and must be supported by all ports. The following 
lists the properties of VL 15: 

C7-18: VL15 shall not be subject to flow control (both link level and end- 
to-end), i.e. VL 15 packets may be transmitted at any time. 

C7-19: Infiniband protocol aware devices shall discard VL15 packets if 
there is not enough room for reception. Other than the packet discard 
counter ( 16.1.3.5 PortCounters on page 732 ) this discard is done silently. 

C7-20: All InfiniBand protocol aware devices shall support sourcing and 
sinking VL 15 packets. 

C7-21: CAs and routers shall provide a minimum of a single packet buffer 
per port for VL15 on each port for reception. 

C7-22: Switches shall provide a minimum of a single packet buffer for 
VL15 per switch. 
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C7-23: VL1 5 packets shall be scheduled preemptively, i.e. they are trans- 1 
mitted ahead of all other packets (including flow control packets). 2 

3 

C7-24: VL nnapping in a switch does not apply to VL15. That is, a packet ^ 
received by a switch on VL15 shall be transmitted on VL15 and no packet 
received on another VL shall be transmitted on VL15. 

6 

C7-25: The SL field shall be set to 0 by devices sourcing VL15 packets 7 
and ignored by devices checking and sinking VL15 packets. 8 

9 

C7-26: VL1 5 packets shall not be forwarded between subnets, i.e. a GRH ^ q 
is not permitted on VL1 5 packets. ^ ^ 

C7-27: Packets using VL15 shall have a maximum payload of 256 pay- 
load bytes. 13 

14 

7.6.4 Buffering and Flow Control For Data VLs 15 

Virtual Lanes provide independent data streams on the same physical 16 
link. 17 

18 
19 
20 
21 

C7-28: For data VLs, each VL on each port shall provide the appearance 22 
of separate buffering resources, i.e. although dedicated buffering re- 23 
sources are not required, the ports must behave as if they were. 24 

25 

C7-29: Each port shall advertise the number of credits available for each 2g 
data VL configured using flow control packets. 

These credit packets and the flow control process are described in 7\9_ 28 
Flow Control on page 175 . 29 

30 

Table 15 Processing of Link Packets on page 150 details the behavior of 3^ 
a port when sending and receiving a link packet for a given data VL. The 22 
following terminology is used in this table (and elsewhere in this specifica- 
tion): 

34 

• A data VL is supported if its VL number is inside the range indicated 35 
by the Portlnfo.VLCap attribute. This indicates that the data VL is 36 
supported by the port. 37 

A data VL is configured if its VL number is inside the range indicated 38 
by the Portlnfo.OperationalVLs attribute.This indicates that the data 39 
VL is currently configured for use by the port. 40 

(Refer to 14.2.5.6 Portlnfo on oaoe 633 for description of Portlnfo.VLCap 41 
and Portlnfo.OperationalVLs.) 42 



For data VLs, separate guaranteed buffering resources, and separate flow 
control shall be provided. (For VL 15, different flow control and buffering 
restrictions apply, and are described in above.) 
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C7-30: Each port shall send and receive link packets as specified in Table 
15 Processing of Link Packets on page 150 

Note, in this table, a required behavior has not been specified for the 
cases where the data VL is supported but not currently configured. This is 
done to support changing of the Data VL configuration. Note further, the 
Data VL configuration may be changed in any PortState including LlnkAc- 
tive. 



Table 15 Processing of Link Packets 



PortState 


Status of a Data VL 


Sending of Credits on that 
Data VL 


Receiving of Credits on 
that Data VL 


Linklnitialize 


Data VL is Configured 


Shall send link packets for 
that Data VL 


Shall be accepted 


Data VL is supported but not 
currently configured 


May send link packets for 
that Data VL 


Shall be ignored, no error 


Data VL is not supported 


Shall not send link packets 
for that Data VL 


Shall be ignored, no error 


LinkArm or LinkAc- 
tlve 


Data VL is Configured 


Shall send link packets on 
that Data VL 


Shall be accepted 


Data VL is supported but not 
currently configured 


Should not send link packets 
on that Data VL 


Shall be Ignored, no error 


Data VL is not supported 


Shall not send link packets 
on unsupported data VLs 


Shall be discarded, mal- 
fornned packet reported 



7.6.5 Service Level 



C7-31 : Each port shall provide sufficient buffering for each configured 
data VL to be able to advertise credit for at least one packet with MTU pay- 
load. 

Note, MTU payload here refers to the lesser of MTUCap and neigh- 
borMTU for that port 

(See 7.7.8 Packet Length (PktLen) - 11 bits on page 160 for definition of 
the corresponding packet size requirement.) 

C7-32: When a data packet arrives at a port, it shall be placed In the buffer 
associated with that input port and VL field in the packet. 



Service Level (SL) Is used to identify different flows within an IBA subnet. 
It is carried In the local route header of the packet. 
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C7-33: The SL of a packet shall not be changed as a packet crosses the 1 
subnet. 2 

3 

The SL Is an indication as to the service class of the packet. IBA does not ^ 
assign any specific meaning to an SL value. SLs are intended as a mech- 
anisnn to aid in providing differentiated services, improved fabric utilization 
and avoiding deadlock. However, the specifics on how this is done is be- ^ 
yond the scope of this specification. 7 

8 

The IBA specification does, however, define two mechanisms using SLs g 
and VLs that are intended as tools to implement Quality of Service (QoS) ^ q 
related services. One is SL-to-VL mapping, the other is data VL arbitra- 
tion. Both are described in detail below. 

12 

o7-4: If multiple data VLs are supported, then both SL-to-VL mapping and ^ 3 
data VL arbitration must be supported (both described below). 14 

15 

If only a single data VL is supported, then neither are required (although 
SL-to-VL mapping may still be implemented for SL filtering-see 7.6.6 VL ^ ^ 
Mapping Within a Subnet on page 152 for a description of this). 

1 o 

C7-34: The only requirement for devices supporting only a single data VL ^ ^ 
with respect to SLs and VLs is that the device shall include the SL value 20 
in the SL field when sourcing a packet into an IBA subnet 21 

22 

Note that switches are included in this list because they can be the source 23 
of packets via their SMI or GSI interfaces. Note also that this specification ^4 
does not require the validation of SL field at the packet destination. 

There are no ordering guarantees between packets of different SL. 26 

27 

The source for SL for different transport services is detailed in 9.10 28 
Header and Data Field Source on page 359 . For connected services (un- 29 
reliable connected, reliable connected and reliable datagrams), the SL as- 
sociated with the forward and reverse paths of the same connection may 
be different (i.e. on the same connection, the SL associated with the De- 
viceA:transmitWQ may be different from that for the DeviceB:trans- 32 
mitWQ). For unreliable and raw datagrams, however, a node can always 33 
respond to a datagram from some other node using the same SL as the 34 
original datagram. 35 



36 
37 



The SL used for a given destination (DLID), QOS, partition etc. is ulti- 
mately provided by the subnet manager. It may also be from derived 
sources such as request packets, local management agents etc. 
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7.6.6 VL Mapping Within A Subnet i 

As a packet is routed across a subnet, it may be necessary for it to change ^ 

VLs when it uses a given link. Examples of where this may be needed in- 3 

elude: 4 

5 

1 ) The link may not support the VL previously used by the packet. This g 
could happen when a device in the fabric supports a limited set of 
VLs. 



10 
11 



7 
8 

2) Two traffic streams arriving on different input ports of a switch may be g 
using the same outgoing link, and may also happen to be using the 
same VL when arriving at the switch. If VL mapping were not sup- 
ported, then both traffic streams would have to use the same VL on 
the output port. VL mapping allows these two streams to be assigned 1 2 
different VLs on the outgoing links. In general, VL mapping offers 13 
greater flexibility in maintaining independent traffic flows within a 14 
fabric. 1 5 

SL to VL mapping is used to change VLs as a packet crosses a subnet. 1 6 

17 

SL to VL mapping is required in channel adapters, switches, and routers 
that support more than one data VL. It is optional in those devices sup- 
porting only one data VL. If it is implemented it shall be implemented in 
accordance with the requirements of this section. 

21 

SL to VL mapping is done using a programmable mapping table. This is 22 
provided by the SLtoVLMappingTable. 23 

24 

o7-5: In channel adapters and routers that support SL to VL mapping, 25 
there shall be a logical table that maps the SL field in the packet LRH to 25 
the VL to be used for that output port. This table is 16 entries deep, with 
each port of the device having an independent table. All 16 possible 
values of SL shall be included in this table. The table indicates the VL 28 
number to be used by that packet when it is transmitted by the port. 29 

30 

o7-6: In switches that support SL to VL mapping, there shall be a logical 3^ 
table that maps the SL, input port and output port of the packet to the VL ^2 
to be used for the next hop. 

33 

This table can be best viewed as a set of tables, one for each output port. 34 
Each of these per output port tables then indicates which VL should be 35 
used by the outgoing packet based on its SL field and the port that it ar- 36 
rived on. Because the switch supports an internal port (refer to 18.2.4.1 37 
Switch Ports on page 817 ) that will also source packets that require VL 33 
mapping, this port is included as one of the input ports in the table. 

o7-7: Thus, in switches that support SL to VL mapping, the overall SLtoV- 
LMappingTable shall be 16*(num__ports+1)*num_ports deep, where 41 

42 
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num_ports is the number of external ports supported by the switch. All 1 6 1 
possible values of SL shall be included in this table. 2 

3 

The table indicates the VL to be used by that packet for the next hop trans- ^ 
mission based on packet SL, input port and output port. 

5 

This table provides mapping for the n+1 input ports (including the internal 6 
port) to n output ports. 7 

8 

Refer to Table 131 SLtoVLMappinaTable on page 644 for details of on the g 
SLtoVLMappingTable. ^ q 

11 

o7-8: Devices implementing SL to VL mapping shall behave as depicted 
in Table 16. 

13 

Table 16 SLtoVLMappingTable Behavior 14 

15 
16 
17 
18 
19 
20 
21 

The number of VLs supported is defined by the Portlnfo.VLCap attribute, 
while the number configured is defined by the Portlnfo.OperationalVLs at- 
tribute. (Refer to 14.2.5.6 Portlnfo on page 633 ) for description of Port- 23 
Info.VLCap and Portlnfo.OperationalVLs.) 24 

25 

Note, the SLtoVLMappingTable may be programmed with VL15 for any 26 
SL that is not authorized to use that port (for channel adapters and 27 
routers) or input-output port path (for switches). As indicated by the above 
table, packets are discarded if the SLtoVLMappingTable returns VL15. 
This filtering is intended as a mechanism to help protect against unautho- ^9 
rized use of SLs, and to help in breaking routing dependency loops (and 30 
thereby avoiding routing deadlocks). 31 

32 
33 

In order to allow devices to be built with different numbers of VLs, the SM 34 
must be able to configure the number of VLs to be used on a given link. 35 
The SM can query each port to determine the number of VLs it supports 
and then configure to a number supported by both ports on the link. Table 
14 on page 148 depicts the number of VLs combinations that each device 
must support. The number of VLs supported is defined by the Port- 38 
Info.VLCap component while the number of VLs configured is defined by 39 
the Portlnfo.OperationalVL (Refer to 14.2.5.6 Portlnfo on pace 633 ). 40 

41 
42 



VL Value in SLtoVLMappingTable 


Action 


VL15 


Discard packet, no error. 


Data VL not configured by port 


Discard packet, no error 


Data VL configured by port 


Forward packet to port using VL 



7.6.7 Initialization and Configuration 
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Ports may be configured to 1 , 2, 4, 8 or 15 VLs and must be configured to 1 

a value equal to or less than the number supported. If an attempt is made 2 

to program the OperationalVLs to a value larger than the VLCap, the port 3 

may load OperationalVLs with any valid value. 4 

p. 

A port must be configured with the same number of VLs for both its 
sending and receiving directions. ^ 

7 

Modification of the SLtoVLMappingTable may be made while the port is in 8 
operation. g 



Table 17 Arbitration Rules for Devices with only one data VL 



Packet type 


Precedence order 


VL15 


Highest 


Flow control packet 


2nd highest 



10 

11 



o7-9: If a port implements SL-to-VL mapping, it shall not allow any packet 
in transit to be fragmented as a result of changing the SLtoVLMapping- 
Table contents. 

13 

Packets may be discarded or mis-mapped during this change, however. 14 

15 

When a channel adapter, router, or switch initializes, the SLtoVLMapping- 
Table is not required to be initialized (i.e.the contents are undefined). The ^ j 
table should be initialized by the SM prior to use by data traffic. 

7.6.8 VL Scheduling and Flow Control For VL15 and Flow Control Packets 

20 

VL15 (i.e. subnet management packets) traffic and flow control packets 
will use preemptive scheduling. The order of precedence is depicted in 
Table 17. 22 

23 

7.6.9 VL Arbitration and Prioritization 24 

VL arbitration refers to the arbitration done for an outgoing link on a 25 

switch, router or channel adapter. Each output port has a separate arbiter. 26 

The arbiter selects the next packet to transmit from the set of candidate 27 

packets available for transmission on that port. 28 

29 

C7-35: The arbiter shall not violate packet ordering rules, i.e. packets on 
a given VL shall not be reordered. 

31 

The following describes the algorithm to be used by the VL arbiter. 32 

33 

7.6.9.1 VL Arbitration When Only One Data VL Is Implemented 34 

Table 17 depicts the arbitration rules for switch, router or channel adapters 35 
that implement only a single data VL. This is a simple priority scheme 35 

37 
38 
39 
40 
41 
42 
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Packet type 


Precedence order 


VLO 


Lowest 
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Table 17 Arbitration Rules for Devices with only one data VL 1 

2 
3 
4 
5 

where all packets at a precedence level are sent before any packets at a 5 
lower precedence level. ^ 

o 

O7-10: Devices innplementing only a single data VL shall transmit packets 
on its output ports using the arbitration rules depicted in Table 17 Arbitra- ^ 
tion Rules for Devices with only one data VL on page 154 . 10 

11 

7.6.9.2 VL Arbitration When IVIultiple Data VL s Are Implemented 1 2 

The implementation of multiple data VLs is an optional feature in IBA. If 13 
they are implemented, however, the implementation shall conform to the 14 
specification detailed in this section. 1 5 

o7-11: For devices implementing more than one data VL, the transmis- 
sion of VL15 packets and flow control packets shall be the same as de- 
picted in Table 17 on page 154 except that here all the data VLs are at a ^ ^ 
lower priority than VL15 (highest) and flow control packets (second 19 
highest). 20 

21 

o7-1 2: Devices implementing more than one data VL shall also implement 22 
the algorithm described in Section 7.6.9.2 for arbitrating between packets 
on the data VLs. 

24 

A two level scheme is employed, using preemptive scheduling layered on 25 

top of a weighted fair scheme. Additionally, the scheme provides a 26 

method to ensure forward progress on the low-priority VLs. The weighting, 27 

prioritization, and minimum forward progress bandwidth is programmable. 28 



29 
30 



VL arbitration is controlled by the VLArbitrationTable (refer to 14.2.5.9 
VLArbitrationTable on page 644 ). This table shall consist of three compo 
nents, High-Priority, Low-Priority and Limit of Higli-Priority The High-Pri- 31 
ority and Low-Priority components are each a list of VLAA/eight pairs. The 32 
Higli-Priority list shall have a minimum length of one and a maximum of 33 
length of 64. The Low-Priority list shall have a minimum length equal to 34 
the number of data VLs supported and a maximum of length of 64. The 35 
High-Priority and Low-Priority component lists are allowed to be of dif- 
ferent length. 

37 

Each list entry shall contain (1) a VL number (values from 0-14), and (2) 38 
a weighting value (values 0-255), indicating the number of 64 byte units 39 
which may be transmitted from that VL when its turn in the arbitration oc- 40 
curs. The PktLen field in the LRH is used to determine the number of units 
in the packet. (Note, the VCRC and also the symbols between packets in- 
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troduced by the physical layer should not be included in VL arbitration 1 
weight calculations.) The calculation shall be maintained to 4 byte incre- 2 
ments. 3 

4 

5 



A weight of 0 indicates that this entry should be skipped. 



If a list entry is programmed for VL15 or for a VL that is not supported or ^ 
is not currently configured by the port, the port may either skip that entry 7 
or send from any supported VL for that entry. 8 

9 
10 
11 

Each configured data VL should be listed in at least one of the component ^ ^ 
lists. There is, however, no requirement for a device to check for this case. 1 3 
Should a configured data VL not appear in either component list, packets 14 
for this data VL may be dropped, sent when the arbiter has no packets to 15 
send or never sent. 



Note, that the same data VL may be listed multiple times in the High or 
Low-Priority component list, and. further, it can be listed in both lists. 



7.6.9.2,2 Arbitration Rules for Data VL Packets 



16 
17 
18 



The Limit of Higti-Priority component indicates the amount high-priority 
packets that can be transmitted without an opportunity to send a low pri- 
ority packet. Specifically, the number of bytes that can be sent is Limit of 
Higli'Priority times 4K bytes, with the counting done the same as de- 20 
scribed above for weights (i.e. the calculation is done to 4 byte increments 21 
and a High-Priority packet can be sent if current byte count has not ex- 22 
ceed exceeded the Limit of High-Priority). A value of 255 indicates that the 23 
byte limit is unbounded. (Note, it the 255 value is used, forward progress ^4 
of low priority packets is not guaranteed by this arbitration scheme.) A 
value of 0 indicates that only a single packet from the high-priority table 
may be sent before an opportunity is given to the low-priority table. 26 

27 

The VLArbitrationTable may be modified when the port is active. This 28 
modification shall not result in fragmentation of any packet that is in 29 
transit. Arbitration rules may violated during this change, however. 

31 

When a channel adapter, router, or switch initializes, the VLArbitrationT- 
able is not required to be initialized (i.e.the contents are undefined). The 
table should be initialized by the SM prior to use by data traffic. 33 

34 

7.6,9.2.1 Arbitration Rules Between VL15, Link Control and Data VL Packets 35 

The rules of table Table 17 on page 154 apply, where the data VLs (VLO- 35 
VL14) have the lowest priority. 3^ 



38 
39 



When there are no VL15 or Flow Control packets to send, the arbitration 
rules in this section apply. 



41 
42 
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7.6.9.2.3 Arbitration Rules Between HicH and Low Priority Components 1 

The High-Priority and Low-Priority components form a two level priority 2 

scheme. Each of these components (or tables) may have a packet avail- 3 

able for transmission. A packet is available for transmission from the High 4 

Priority table if the following test succeeds: ^ 

6 
7 

1 ) the VL field matches that of any packets that are currently waiting for 8 
transmission for this port AND 9 

3) there is available credit to send that packet ^ 0 

11 

An entry with 0 weight is considered not in the list. 

1 9 

Note, Implementations may check if HiPriAvailWeight is available in determining if a 

packet is available. 1 3 

14 

Upon completion of transmission of a packet the following test should be ^ 5 
done to determine which table to use to transmit the next packet: 

If the High-Priority table has an available packet for transmission (as de- 

fined above) and the HighPriCounter has not expired, then the High-Pri- 18 

ority is said to be active and a packet may be sent from the High-Priority 1 9 

table. 20 

21 

If the High-Priority table does not have an available packet for transmis- 22 
sion (as defined above), or if the HighPriCounter has expired, then the 
HighPriCounter shall be reset, the Low-Priority table is said to be active 
and a packet may be sent from the Low-Priority table. 24 

25 

The following rules govern the operation of the HighPriCounter: 26 

27 

1 ) The HighPriCounter expires when its current value is negative. 23 

2) If the value in the Limit of High-Priority component is not 255, then for 29 
each High-Priority packet transmitted, the size of the packet (as de- 30 
fined by the PktLen field in the LRH) is deducted from the current 3^ 
value of the HighPriCounter. The calculation should be maintained to 

4 byte increments. 

33 

3) When the HighPriCounter is reset, the value in the Limit of High-Pri- 3^ 
ority component times 4K bytes is loaded into the HighPriCounter. 

00 

7.6.9.2.4 Arbitration Rules Within the High and Low components 3g 

Within each High or Low Priority table, weighted fair arbitration is used, 37 
with the order of entries in each table specifying the order of VL sched- 
uling, and the weighting value specifying the amount of bandwidth allo- 
cated to that entry. Each entry in the table is processed in order. 

40 
41 
42 
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7.7 Local Route Header 



A separate pointer and available weight count is maintained for each of 
the two tables. The pointers identify the current entry in the table, while the 
available weight count indicates the amount of weight that the current 
entry has available for data packet transmission. When a table is active 
(as defined in the previous section), the current entry in the table is in- 
spected. A packet corresponding to this entry will be sent to the output 
port for transmission and the packet size (in 4 byte increments) will be de- 
ducted from the available weight count for the current entry, if all of the fol- 
lowing are true: 

1 ) The available weight for the list entry is positive. 

2) There is a packet available for the VL of the entry 

3) Buffer credit is available for this packet. 

Note, if the available weight at the start of a new packet is positive, condi- 
tion 1 above is satisfied, even if the packet is larger than the available 
weight. 

When any of these conditions is not true, the next entry in the table is in- 
spected. The current pointer is moved to the next entry in the table, the 
available weight count is set to the weighting value for this new entry, and 
the above test repeated. This is repeated until a packet is found that can 
be sent to the port for transmission. If the entire table is checked and no 
entry can be found satisfying the above criteria, the other table becomes 
active. 

This description depicts the logical flow of the arbitration process, but 
does not specify performance requirements. Implementations shall per- 
form in a logically consistent manner with the above description. Imple- 
mentations may process steps in parallel and may pipeline tests. As an 
example of pipelining of tests, the check that there be available packets 
may return false if a packet has just recently been forwarded to output port 
but the arbiter logic has not processed its arrival. 

Further, implementations are not required to implement the pointers, 
available weight counter and HighPriCounter. They must, however, be- 
have in a manner equivalent to that described in this section. 



Local Routing Header - LRH - 8 bytes 

The Local Routing Header (LRH) contains the fields for local routing by 
switches within a IBA subnet. The LRH is at the start of every packet and 
the packet ends with the Variant CRC. The LRH is 8 bytes long. For addi- 
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tional information on overall packet layout, see Chapter 5: Data Packet 
Format on page 117 . 

Figure 54 Local Route Header (LRH) 



bits 
bytes 


31-24 


23-16 


15-8 


7-0 


0-3 


VL LVer 


SL Rsv2 LNH 


Destination Local Identifier 


4-7 


Reserve 5 Packet Length (11 bits) 


Source Local Identifier 



C7-36: The LRH shall use the format specified in Figure 54 Local Route 
Header (LRH) on page 159 . 

7.7.1 Virtual Lane (VL) - 4 bits 

Specifies a virtual lane to be used for a packet. This field identifies which 
receive buffer and which receive flow control credits should be used for 
the received packet. 

C7-37: The VL field shall be set to the VL on which the packet is sent. 

The virtual lane can change from link to link in a subnet Since the Virtual 
Lane can change, the Link Virtual Lane is not included in the Invariant 
CRC field. 

7.7.2 Link Version (LVer) - 4 bits 

Specifies the version of the Local Routing Header used for this packet. 
This version applies to the general packet structure including the LRH 
fields and the variant CRC. 

07-38: The LVer field shall be set to 0x0. 

If a receiving device does not support the Link Version specified then the 
packet is discarded. 

7.7.3 Service Level (SL) - 4 bits 

The Service Level field. This field is used by switches to determine the Vir- 
tual Lane used for this packet. This is described in Section 7.6.5 on page 
150. 



7.7.4 Reserve - 2 bits 



07-39: The 2-bit Reserve field shall be transmitted as 00 and shall be ig- 
nored on receive. 
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7.7.5 Link Next Header (LNH) - 2 bits 

Specifies what headers following the Local Routing Header. The first bit 
(msb) indicates whether the packet uses IBA transport. The second bit 
(Isb) indicates whether a GRH/IPv6 header is present. 

Table 18 Link Next Header Definition 



Packet Type 


LNH bit 1 
IBA Transport 


LNH bit 0 
GRH/IPv6 
header 


Transport 


Next Header 


IBA global 


1 


1 


IBA 


GRH 


IBA local 


1 


0 


IBA 


BTH 


IP - non-IBA transport 


0 


1 


Raw 


GRH 


Raw 


0 


0 


Raw 


RWH 
(Ethertype) 



C7-40: The LNH field shall indicate the packet type of the following packet 
as defined by Table 18 Link Next Header Definition on page 160 . 

7.7.6 Destination Local Identifier (DLID) - 16 bits 

Specifies the LID of the port to which the subnet delivers the packet. LIDs 
are unique within a subnet. More specifically it identifies the route to take 
to the destination port. If the packet is to be routed to another subnet, then 
this is the LID of the Router. 



7.7.7 Reserve - 5 bits 



C7-41: The 5 bit reserve field shall be transmitted as 00000 and shall be 
ignored on receive. 



7.7.8 Packet Length (PktLen) - 11 bits 

The number of 4 byte words contained in the packet. 

C7-42: The value of the PktLen field shall equal the number of bytes in all 
the fields starting with the first byte of the Local Route Header and the last 
byte before the Variant CRC, inclusive, divided by 4. 

The maximum allowable size of all headers plus the CRC fields is 126 
bytes. The maximum value of this field is (4096 + 126 - 2)/4 = 4220 / 4= 
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1055, reflecting a maximum of 126 bytes for all headers and CRCs minus 
the uncounted variant CRC. 

Table 19 Packet Size 



MTU 


Maximunfi 
Packet Length 
(Bytes/4) 


Maximum Bytes 
{MTU+126) 


256 


95 


382 


512 


159 


638 


1024 


287 


1150 


2048 


543 


2174 


4096 


1055 


4222 



C7-43: For packets with IBA transport, the smallest allowed value for 
Packet Length is 6 (24 Bytes) including LRH. 

C7-44: For raw packets, the smallest allowed value for Packet Length is 
5 (20 Bytes) including LRH. 

C7-45: The maximum allowed value for Packet Length is the value shown 
in Table 19 Packet Size on page 161 for the smaller of MTUCap and 
NeighborMTU. 



7.7.9 Source Local Identifier (SLID) - 16 bits 



7.8 CRCs 

7.8.1 Invariant CRC (ICRC) 



C7-46: SLID shall be the LID of the port which injected the packet onto the 
subnet. 

The subnet manager assigns each node a LID which is unique within a 
subnet. 



- 4 Bytes 

Specifies a Cyclic Redundancy Code covering all the fields of the Packet 
which are invariant from end to end through all switches and routers on 
the network. This field is present in all IBA packets but is NOT present in 
Raw Packets because for raw packets it is not known which fields will be 
invariant. The CRC calculation is re-started with each packet in the mes- 
sage. Which header fields that are included depends on whether the 
Global Routing Header is present because the router may modify addi- 
tional header fields. 

07-47: The ICRC field shall be present in all IBA transport packets. 

07-48: The ICRC field shall be calculated as specified in Section 7.8.1. 
"Invariant CRC HCRC) - 4 Bvtes." on page 161 . 
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If the packet is local to the subnet (the Global Routing Header is not 1 
present), then the ICRC calculation is as follows: 2 

3 
4 

• Local Routing Header: except for the VL. 5 

• Base Transport Header: except for the ResvBa field 6 

• Extension Transport Headers (if present), ^ 

• Packet Payload (if present), ^ 

• With no GRH, the ICRC excludes: (these fields are replaced with 1s 
for the ICRC calculation) 

11 

• Local Routing Header: VL, -I2 

• Base Transport Header: ResvSa. 13 

If the packet is routed between subnets, so the Global route header is 14 
present, the ICRC calculation is as follows: 15 

16 

• With a GRH, the ICRC includes: 7 

• Global Routing Header: Version, Payload length, Next Header, 18 
Source IPV6 address, and Destination IPV6 address 

Base Transport Header, except for the ResvSa field. 20 

• Extension Transport Headers (if present), 21 

22 

Packet Payload (if present). 

• With a GRH, the ICRC excludes: (these fields are replaced with 1's 
for the CRC calculation) 

• Local Routing Header, all fields, 2q 

• Global Routing Header: Flow label. Traffic Class, and Hop Limit 27 
fields. 28 

• Base Transport Header: ResvSa. 29 

All fields in the packet, including those excluded from the Invariant CRC, 30 
are protected by the Variant CRC described in the next section. 31 

32 

The polynomial used is the same CRC-32 used by Ethernet - 33 
0x04C1 1 DB7. The procedure for the calculation is: ^4 

1 ) The initial value of the CRC-32 calculation is OxFFFFFFFF. 

36 

2) The CRC calculation is done in big endian byte order with the least 
significant bit of the most significant byte being the first bits in the 
CRC calculation. 

39 
40 
41 
42 



3) The bit sequence from the calculation is complemented and the 
result is the ICRC. 
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4) The resulting bits are sent in order from the bit representing the coef- 
ficient of the highest tenri of the remainder polynomial. The least sig- 
nificant bit, most significant byte first ordering of the packet does not 
apply to the ICRC field. 

The CRC always starts with LRH:LVer bit 0, whether GRH is present or 
not. 

This bit and byte ordering is consistent with Ethernets CRC calculation. 
Figure 55 CRC Calculation Order 
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7.8.2 Variant CRC (VCRC) - 2 Bytes 

Specifies a Cyclic Redundancy Code covering all fields of the Packet. 
This field is present in all data packets including Raw Packets and in- 
cludes all bytes from the first byte of the LRH to the last byte before the 
Variant CRC, inclusive. Since a number of these fields can change as the 
packet is processed by switches and routers the Variant CRC may have 
to regenerated at each Link through the subnet. If a switch does not 
change any fields including the Link Virtual Lane, then the Variant CRC 
does not have to be regenerated. 

07-49: The VCRC field shall be present in all data packets. 

C7-50: The VCRC field shall be calculated as specified in Section 7.8.2. 
"Variant CRC (VCRC) - 2 Bvtes." on oaae 163 . 

The polynomial used is the same CRC-16 used by HIPPI-6400 - 0x1008. 
The procedure for the calculation is: 

1) The initial value of the CRC-16 calculation is OxFFFF. 

2) The CRC calculation is done In big endian byte order with the least 
significant bit of the first byte of the Local Route Header (bit 0 of 
LRH:LVer) being the first bit in the CRC calculation. 

3) The bit sequence from the calculation is complemented and the 
result is the VCRC. 
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4) The resulting bits are sent in order from the bit representing the coef- 1 

ficient of the highest temi of the remainder polynomial. The least sig- 2 

nificant bit. most significant byte first ordering of the packet does not 3 

apply to the VCRC field. 4 

This bit and byte ordering is consistent with Ethernet's CRC calculation. 5 

6 

7-8.3 Link Packet CRC (LPCRC) - 2 Bytes ^ 

Specifies a Cyclic Redundancy Code covering all fields of the Link Packet. 3 
This field is present in all Link packets including Flow Control Link Packets g 
and includes all bytes from the first byte of the Opcode to the last byte be- ^ ^ 
fore the LPCRC, inclusive. This field is always computed for each Link- 
packet. 

12 

C7-51: The LPCRC field shall be present in all link packets. 13 

14 

C7-52: The LPCRC field shall be calculated as specified in Section 7.8.3. ^ 5 
"Link Packet CRC (LPCRC) - 2 Bvtes." on pace 164 . 

17 
18 
19 

1 ) The initial value of the CRC-1 6 calculation is OxFFFF. 20 

2) The CRC calculation is done in big endian byte order with the least 21 
significant bit of the first byte of the Local Route Header (bit 0 of 22 
LRH:LVer) being the first bit in the CRC calculation. 23 

3) The bit sequence from the calculation is complemented and the 24 
result is the LPCRC; 25 

4) The resulting bits are sent in order from the bit representing the coef- 
ficient of the highest term of the remainder polynomial. The least sig- 27 
nificant bit, most significant byte first ordering of the packet does not 28 
apply to the LPCRC field. 29 

This bit and byte ordering is consistent with Ethernet's CRC calculation. 30 

31 

7.8.4 CRC Calculation Samples 32 

The following is an example of CRC calculation. The requirements for the 33 
CRC calculation are specified above, this section is intended for informa- 34 
tive purposes only. 35 

36 
37 

The polynomial used for ICRC calculation is 0x04C11DB7. The seed 33 
value is OxFFFFFFFF. The ICRC Generator Remainder is the comple- 
ment of the resulting calculation. 

41 
42 



7.8.4.1 ICRC Generator 
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The ICRC Generator actual implementation is not specified. The diagram 
in Figure 56 is provided as a reference with the sole purpose of clarifying 
the calculation details and does not imply a required implementation. 



Figure 56 ICRC Generator 
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The 32 Flip-Flops are initialized to 1 before every ICRC generation. The 
packet is fed to the reference design of Figure 56 one bit at a time in the 
order specified in Section 7.8.1 on page 161 . The Remainder is the bit- 
wise NOT of the value stored at the 32 Flip-Flops after the last bit of the 
packet was clocked into the ICRC Generator. ICRC Field is obtained from 
the Remainder as shown in Figure 56. ICRC Field is transmitted using 
Big Endian byte ordering like every field of an InfiniBand packet. 



7.8.4.2 VCRC Generator 



The polynomial used for VCRC and FCCRC calculation is 0x1 OOB. The 
seed value is OxFFFF. The VCRC/FCCRC Generator Remainder is the 
complement of the resulting calculation. 
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The VCRC and FCCRC are generated in the same manner as described 
above for the ICRC. Figure 57 shows the reference design for the VCRC 
/ FCCRC Generator. 



Figure 57 VCRC / FCCRC Generator 
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7.8.4.3 Sample Packets 
7.8.4.3.1 Local Packet Example 



Figure 58 shows the structure of the local packet used for the example. 
The packet is a RDM A Write Only carrying a payload of 14 bytes. 



Figure 58 Local Packet Example 
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The header values for the sample packet are shown in Table 20, Table 21 
and Table 22 respectively. The data payload is shown in Table 23 



Table 20 LRH 



Field 


Value 


VL 


0x7 


LVer 


0x0 


SL 


0x1 


LNH 


0x2 


DLID 


0x375C 


PktLen 


OxE 


SLID 


0x1 7D2 


Table 21 BTH 


Field 


Value 


Opcode 


OxOA 


SE 


0x0 


M 


0x0 


Pad 


0x2 


TVer 


0x0 


PKey 


0x2487 


Dest QP 


0x87B1B3 


AckReq 


0x0 


PSN 


0X0DEC2A 


Table 22 RETH 


Field 


Value 


VA 


0x01710A1C015D4002 


RKey 


0x38f27A05 
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Table 22 RETH 1 

2 



Field 


Value 


DMA Length 


OxOOOOOOOE 



Table 23 Payload 



3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 

The combined byte stream for the Local Packet (before ICRC and VCRC) 
is shown in Table 24 

32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 





Byte 


Value 


0 




OxBB 


1 




0x88 


2 




0x4D 


3 




0x85 


4 




OxFD 


5 




0x5C 


6 




OxFB 


7 




0xA4 


8 




0x72 


9 




0x8B 


10 




OxCO 


11 




0x69 


12 




OxOE 


13 




0xD4 
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Table 24 Local Packet Byte Stream (before ICRC and VCRC) 



Byte 


Value 


Byte 


Value 


Byte 


Value 


Byte 


Value 


0 


0x70 


15 


0x83 


30 


0x7A 


45 


0x8B 


1 


0x12 


16 


0x00 


31 


0x05 


46 


OxCO 


2 


0x37 


17 


OxOD 


32 


0x00 


47 


0x69 


3 


0x5C 


18 


OxEC 


33 


0x00 


48 


OxOE 


4 


0x00 


19 


0x2A 


34 


0x00 


49 


0xD4 


5 


OxOE 


20 


0x01 


35 


OxOE 


50 


0x00 


6 


0x17 


21 


0x71 


36 


OxBB 


51 


0x00 


7 


0xD2 


22 


OxOA 


37 


0x88 




8 


OxOA 


23 


Ox1C 


38 


0x4D 


9 


0x20 


24 


0x01 


39 


0x85 


10 


0x24 


25 


0x5D 


40 


OxFD 


11 


0x87 


26 


0x40 


41 


0x5C 


12 


0x00 


27 


0x02 


42 


OxFB 


13 


0x87 


28 


0x38 


43 


0xA4 


14 


OxBI 


29 


0xF2 


44 


0x72 



Table 25 shows the masked byte stream used for ICRC calculation. 



Table 25 Masked Byte Stream for ICRC Calculation 



Byte 


Value 


Byte 


Value 


Byte 


Value 


Byte 


Value 


0 


OxFO 


15 


0xB3 


30 


0x7A 


45 


0x8B 


1 


0x12 


16 


0x00 


31 


0x05 


46 


OxCO 


2 


0x37 


17 


OxOD 


32 


0x00 


47 


0x69 


3 


0x5C 


18 


OxEC 


33 


0x00 


48 


OxOE 


4 


0x00 


19 


0x2A 


34 


0x00 


49 


0xD4 


5 


OxOE 


20 


0x01 


35 


OxOE 


50 


0x00 


6 


0x17 


21 


0x71 


36 


OxBB 


51 


0x00 


7 


0xD2 


22 


OxOA 


37 


0x88 


■ . '.. '-■ ■■■ /■ y. ■: ■ 


8 


OxOA 


23 


0x1 C 


38 


0x4D 


9 


0x20 


24 


0x01 


39 


0x85 


10 


0x24 


25 


0x5D 


40 


OxFD 


11 


0x87 


26 


0x40 


41 


0x5C 


12 


OxFF 


27 


0x02 


42 


OxFB 


13 


0x87 


28 


0x38 


43 


0xA4 


14 


OxBI 


29 


0xF2 


44 


0x72 
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Generated ICRC is: 0x9625B75A 
Generated VCRC is: 0x45FA 

Table 26 shows the complete Local Packet Byte Stream. 

Table 26 Local Packet Byte Stream 



Byte 


Value 


Byte 


Value 


Byte 


Value 


Byte 


Value 


0 


0x70 


15 


0x83 


30 


0x7A 


45 


Ox8B 


1 


0x12 


16 


0x00 


31 


0x05 


46 


OxCO 


2 


0x37 


17 


OxOD 


32 


0x00 


47 


0x69 


3 


0x5C 


18 


OxEC 


33 


0x00 


48 


OxOE 


4 


0x00 


19 


0x2A 


34 


0x00 


49 


0xD4 


5 


OxOE 


20 


0x01 


35 


OxOE 


50 


0x00 


6 


0x17 


21 


0x71 


36 


OxBB 


51 


0x00 


7 


0xD2 


22 


OxOA 


37 


0x88 


52 


0x96 


8 


OxOA 


23 


0x1 C 


38 


0x4D 


53 


0x25 


9 


0x20 


24 


0x01 


39 


0x85 


54 


0xB7 


10 


0x24 


25 


0x5D 


40 


OxFD 


55 


0x5A 


11 


0x87 


26 


0x40 


41 


0x5C 


56 


0x45 


12 


0x00 


27 


0x02 


42 


OxFB 


57 


OxFA 


13 


0x87 


28 


0x38 


43 


0xA4 




14 


OxBI 


29 


0xF2 


44 


0x72 



7.8.4.3.2 Global Packet Example 



Figure 59 shows the structure of the Global packet used for the example. 



Figure 59 Global Packet Example 
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The BTH, RETH and data payload for the Global example packet are the 
same as for the Local packet one. The values for the LRH and GRH fields 
are shown in Table 27 and Table 28. 



Table 27 LRH 



Field 


Value 


VL 


0x7 


LVer 


0x0 


SL 


0x1 


LNH 


0x3 


DLID 


0x375C 


PktLen 


0x18 


SLID 


0x1 7D2 



Table 28 GRH 


Field 


Value 


IPVer 


0x6 


TCIass 


0x00 


FlowLabel 


0x00000 


PayLen 


0x0032 


NxtHdr 


0x00 


HopLmt 


0x10 


SGID 


0x00000000000001250000000000000026 


DGID 


0x00000000000001 1 70000000000000096 



The combined byte stream for the Global Packet (before ICRC and 
VCRC) is shown in Table 29 
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Table 29 Global Packet Byte Stream (before ICRC and VCRC) 



Byte 


Value 


Byte 


Value 


Byte 


Value 


Byte 


Value 


0 


0x70 


25 


0x00 


50 


0x24 


75 


OxOE 


1 


0x13 


26 


0x00 


51 


0x87 


76 


OxBB 


2 


0x37 


27 


0x00 


52 


0x00 


77 


0x88 


3 


0x5C 


28 


0x00 


53 


0x87 


78 


Ox4D 


4 


0x00 


29 


0x00 


54 


OxBI 


79 


0x85 


5 


0x18 


30 


0x00 


55 


0x83 


80 


OxFD 


6 


0x17 


31 


0x26 


56 


0x00 


81 


0x5C 


7 


0xD2 


32 


0x00 


57 


OxOD 


82 


OxFB 


8 


0x60 


33 


0x00 


58 


OxEC 


83 


0xA4 


9 


0x00 


34 


0x00 


59 


0x2A 


84 


0x72 


10 


0x00 


35 


0x00 


60 


0x01 


85 


0x8B 


11 


0x00 


36 


0x00 


61 


0x71 


86 


OxCO 


12 


0x00 


37 


0x00 


62 


OxOA 


87 


0x69 


13 


0x32 


38 


0x01 


63 


0x1 C 


88 


OxOE 


14 


0x00 


39 


0x17 


64 


0x01 


89 


0xD4 


15 


0x10 


40 


0x00 


65 


0x5D 


90 


0x00 


16 


0x00 


41 


0x00 


66 


0x40 


91 


0x00 


17 


0x00 


42 


0x00 


67 


0x02 




18 


0x00 


43 


0x00 


68 


0x38 


19 


0x00 


44 


0x00 


69 


0xF2 


20 


0x00 


45 


0x00 


70 


0x7A 


21 


0x00 


46 


0x00 


71 


0x05 


22 


0x01 


47 


0x96 


72 


0x00 


23 


0x25 


48 


OxOA 


73 


0x00 


24 


0x00 


49 


0x20 


74 


0x00 



Table 30 shows the masked byte stream used for ICRC calculation. 
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Table 30 Masked Byte Stream for iCRC Calculation 



Byte 


Value 


Byte 


Value 


Byte 


Value 


Byte 


Value 


0 


OxFF 


25 


0x00 


50 


0x24 


75 


OxOE 


1 


OxFF 


26 


0x00 


51 


0x87 


76 


OxBB 


2 


OxFF 


27 


0x00 


52 


OxFF 


77 


0x88 


3 


OxFF 


28 


0x00 


53 


0x87 


78 


Ox4D 


4 


OxFF 


29 


0x00 


54 


0x81 


79 


0x85 


5 


OxFF 


30 


0x00 


55 


0xB3 


80 


OxFD 


6 


OxFF 


31 


0x26 


56 


0x00 


81 


0x5C 


7 


OxFF 


32 


0x00 


57 


OxOD 


82 


OxFB 


8 


0x6F 


33 


0x00 


58 


OxEC 


83 


0xA4 


9 


OxFF 


34 


0x00 


59 


0x2A 


84 


0x72 


10 


OxFF 


35 


0x00 


60 


0x01 


85 


0x8B 


11 


OxFF 


36 


0x00 


61 


0x71 


86 


OxCO 


12 


0x00 


37 


0x00 


62 


OxOA 


87 


0x69 


13 


0x32 


38 


0x01 


63 


0x1 C 


88 


OxOE 


14 


0x00 


39 


0x17 


64 


0x01 


89 


0xD4 


15 


OxFF 


40 


0x00 


65 


0x5D 


90 


0x00 


16 


0x00 


41 


0x00 


66 


0x40 


91 


0x00 


17 


0x00 


42 


0x00 


67 


0x02 




18 


0x00 


43 


0x00 


68 


0x38 


19 


0x00 


44 


0x00 


69 


0xF2 


20 


0x00 


45 


0x00 


70 


0x7A 


21 


0x00 


46 


0x00 


71 


0x05 


22 


0x01 


47 


0x96 


72 


0x00 


23 


0x25 


48 


OxOA 


73 


0x00 


24 


0x00 


49 


0x20 


74 


0x00 



ICRC Result is:0x1493EC46 
VCRC Result is: 0x7C44 

Table 31 shows the complete Global Packet Byte Stream. 
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Table 31 Global Packet Byte Stream 



Byte 


Value 


Byte 


Value 


Byte 


Value 


Byte 


Value 


0 


0x70 


25 


0x00 


50 


0x24 


75 


OxOE 


1 


0x13 


26 


0x00 


51 


0x87 


76 


OxBB 


2 


0x37 


27 


0x00 


52 


0x00 


77 


0x88 


3 


0x5C 


28 


0x00 


53 


0x87 


78 


0x4D 


4 


0x00 


29 


0x00 


54 


0x81 


79 


0x85 


5 


0x18 


30 


0x00 


55 


0xB3 


80 


OxFD 


6 


0x17 


31 


0x26 


56 


0x00 


81 


0x5C 


7 


0xD2 


32 


0x00 


57 


OxOD 


82 


OxFB 


8 


0x60 


33 


0x00 


58 


OxEC 


83 


0xA4 


9 


0x00 


34 


0x00 


59 


0x2A 


84 


0x72 


10 


0x00 


35 


0x00 


60 


0x01 


85 


0x8B 


11 


0x00 


36 


0x00 


61 


0x71 


86 


OxCO 


12 


0x00 


37 


0x00 


62 


OxOA 


87 


0x69 


13 


0x32 


38 


0x01 


63 


0x1 C 


88 


OxOE 


14 


0x00 


39 


0x17 


64 


0x01 


89 


0xD4 


15 


0x10 


40 


0x00 


65 


0x5D 


90 


0x00 


16 


0x00 


41 


0x00 


66 


0x40 


91 


0x00 


17 


0x00 


42 


0x00 


67 


0x02 


92 


0x14 


18 


0x00 


43 


0x00 


68 


0x38 


93 


0x93 


19 


0x00 


44 


0x00 


69 


0xF2 


94 


OxEC 


20 


0x00 


45 


0x00 


70 


0x7A 


95 


0x46 


21 


0x00 


46 


0x00 


71 


0x05 


96 


0x7C 


22 


0x01 


47 


0x96 


72 


0x00 


97 


0x44 


23 


0x25 


48 


OxOA 


73 


0x00 




24 


0x00 


49 


0x20 


74 


0x00 



7.8.4.3.3 Link Packet Example 



The field values for the Link Packet example are shown in Table 32. 

Table 32 Link Packet 



Field 


Value 


Op 


0x0 


FCTBS 


Ox10D 


VL 


0x5 
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Table 32 Link Packet 



7-9 Flow Control 
7.9.1 Introduction 



Field 


Value 


FOOL 


0x21 B 



Generated FCCRC: 0xF9C9 



Table 33 Link Packet Byte Stream 



Byte 


Value 


0 


0x01 


1 


OxOD 


2 


0x52 


3 


0x1 B 


4 


0xF9 


5 


0xC9 



This section describes the link level flow control mechanism utilized by 
IBA to prevent the loss of packets due to buffer overflow by the receiver 
at each end of a link. This mechanism does not describe end to end flow 
control such as might be utilized to prevent transmission of messages 
during periods when receive buffers are not posted. 

Throughout this section, the terms "transmitter" and "receiver" are utilized 
to describe each end of a given link. The transmitter is the node sourcing 
data packets. The receiver is the consumer of the data packets. Each end 
of the link has a transmitter and a receiver. 

IBA utilizes an "absolute" credit based flow control scheme. Unlike many 
traditional flow control schemes which provide incremental updates that 
are added to the transmitters available buffer pool, IBA receivers provide 
a "credit limit". A credit limit is an indication of the total amount of data that 
the transmitter has been authorized to send since link initialization. 

Errors in transmission, in data packets, or in the exchange of flow control 
information can result in inconsistencies in the flow control state perceived 
by the transmitter and receiver. The IBA flow control mechanism provides 
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for recovery from this condition. The transmitter periodically sends an in- 
dication of the total amount of data that it has sent since link initialization. 
The receiver uses this data to re-synchronize the state between the re- 
ceiver and transmitter. 



7.9.2 Flow Control Blocks 



The term "flow control block", or simply "block" indicates a quantity of data 
in a data packet. This quantity is defined to be the size of the data packet 
in bytes (every byte between the local route header and the variant CRC, 
inclusive) divided by 64 bytes, and rounded up to the next integral value. 



7.9.3 Relationship to Virtual Lanes 



The flow control algorithm defined in this chapter is applied to each virtual 
lane independently, except for virtual lane 15 which is not subject to link 
level flow control. 



7.9.4 Flow Control Packet 

Figure 60 Flow Control Packet Format 
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C7-53: Flow control packets shall be sent for each VL except VL15 upon 
entering the Linl<lnitiaiize state. When in the PortStates Linl<lnitialize, 
LinkArm or LinkActive, a flow control packet for a given virtual lane shall 
be transmitted prior to the passing of 65,536 symbol times since the last 
time a flow control packet. for the given virtual lane was transmitted. 

C7-54: Flow control packets shall use the format specified in Figure 60 
Flow Control Packet Format on page 1 76 . 

A symbol time is defined as the time required to transmit an eight bit data 
quantity onto the link. Flow control packets may be transmitted as often as 
necessary to return credits and enable efficient utilization of the link. See 
Section 7.6.4. "Buffering and Flow Control For Data VLs." on page 149 for 
additional information. 

7.9.4.1 Flow Control Packet Fields 
7.9.4.1.1 Operand (Op) - 4 Bus 

The flow control packet is a link packet with one of two Op (operand) 
values: An operand of 0x0 indicates a normal flow control packet. An op- 
erand value of 0x1 indicates a flow control init packet. 
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C7-56: When in the PortStates LinkArm or LinkActive, flow control 
packets shall be sent with the normal flow control operand, 0x0. 



C7-55: When in the PortState Linklnitialize, flow control packets shall be 1 
sent with the flow control init operand, 0x1 . 2 

3 
4 
5 

C7-57: All other values of the Op field are reserved for operations that ^ 
may be defined by IBA in the future. Any packet received with a reserved 7 
value shall be discarded. 8 

9 

10 
11 
12 
13 

7.9.4.1.3 Flow Control Credit Limit (FCCL) -12 Bits 14 

The FCCL (Flow Control Credit Limit) field is generated by the receiver 15 
side logic. The calculation for the value of FCCL is described later. 16 

17 

7.9.4.1 .4 Virtual Lane (VL) - 4 Bits ^ g 

VL (Virtual Lane) is set to the virtual lane to which the FCTBS and FCCL >jg 
fields apply. 2o 



7.9.4.1,2 Flow Control Total Blocks Sent (FCTBS) - 12 Bits 

The FCTBS (Flow Control Total Blocks Sent) field is generated by the 
transmitter side logic. The calculation for the value of FCTBS is described 
later. 



7.9.4,2 Calculation of FCTBS 



C7-59: When in the PortState initialize, FCTBS shall be set to zero. 

7.9.4.3 Calculation of FCCL 



21 
22 



7.9.4.1.5 Link Packet Cyclic Redundancy Check (LPCRC) - 16 Bits 

LPCRC (Link Packet Cyclic Redundancy Check) field is a 16-bit CRC that 
covers the first four bytes of the flow control packet. See Section 7.8.3, 23 
"Link Packet CRC (LPCRC) - 2 Bvtes." on page 164 . 24 



25 
26 

C7-58: Upon transmission of a flow control packet, FCTBS shall be set to 27 
the total blocks transmitted in the virtual lane since link initialization. 28 

29 
30 
31 
32 



The FCCL calculation is based on a 12-bit Adjusted Blocks Received 
(ABR) counter maintained for each virtual lane at the receiver. 

34 

C7-60: The ABR counter shall be set to zero when in the PortState ini- 35 
tialize. 35 

37 

C7-61 : Upon receipt of each flow control packet, the ABR shall be set to 
the value of the FCTBS field. 

39 

C7-62: Upon receipt of each data packet, the ABR shall be increased by 40 
the blocks received, modulo 4096, except that the ABR shall not be in- 41 

42 



InfiniBand^'^ Trade Association 



Page 177 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Link Layer October 24, 2000 

Volume 1 - General Specifications FINAL 

creased for received packets that are discarded due to lack of receive ca- 1 
pacity in the receiver 2 

3 
4 
5 

Upon transmission of a flow control packet, FCCL shall be set to one of 6 
the following: 7 

8 

If the current buffer state of the receiver would permit reception of 9 
2048 or more blocks from all combinations of valid IBA packets with- 
out discard, then the FCCL shall be set to ABR plus 2048 modulo 
4096. 



07-63: The FCCL field shall be set as specified in Section 7.9.4.3. "Cal- 
culation of FCCL." on page 177 . 



11 
12 
13 
14 
15 
16 
17 



• Otherwise the FCCL shall be set to ABR plus the number of blocks 
the receiver is capable of receiving from all combinations of valid IBA 
packets without discard modulo 4096. 

The number of blocks the receiver is capable of receiving means the 
number that the receiver can guarantee to receive without buffer overflow 
regardless of the sizes of the packets that arrive. If, for example, a re- 
ceiver is capable of receiving more data when large packets arrive than 1 8 
for small packets, the receiver must use the smaller capacity to calculating 1 9 
FCCL. 20 

21 

This specification does not preclude the reconfiguration of receive buffers 22 
while the link is active. Such reconfiguration may result in changes of the 
FCCL value, including the possibility of reduction of available credit. Also, 

link errors may cause discrepancies between ABR at the receiver and 24 

FCTBS at the transmitter. When this has happened, the next flow control 25 

update to the receiver will correct the value of ABR and may result in 26 

changes of FCCL which reduce or increase credit. When FCCL is up- 27 

dated, the credit calculation for outgoing data packets should use the new 23 
value. Packets that are currently being transmitted or queued may be sent 
based on the previous FCCL value. 

30 

7.9.4.4 Transmission of Packets 31 

If a data packet is available for transmission: 

33 

• Let CR represent the total blocks sent since link initialization plus the 34 
number of blocks in the data packet to be transmitted, all modulo 35 
4096. 3e 

• Let CL represent the last FCCL received in a flow control packet. 37 

If (CL-CR) modulo 4096 < 2048, then the data packet may be transmitted. 38 

If the condition is not true, then the data packet may not be transmitted 39 

until the condition becomes true. Flow control packet transmission is not 40 

subject to this restriction nor are any packets on virtual lane 15. 4-^ 

42 
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C7-64: A non-VL1 5 data packet may only be sent when there is sufficient 1 
credit as determined by the calculation in Section 7.9.4.4. "Transmission 2 
of Packets." on oaoe 178 . 3 



C7-65: VL15 packets shall not be subject to flow control. 



4 
5 

7.1 0 IBA AND Raw Packet Multicast ^ 
7.10.1 Overview 



Multicast is a one-to-many communication paradigm designed to improve 
the efficiency of communication between a set of end nodes. Figure 61 il- 
lustrates an example unreliable multicast IBA operation: 



8 

9 

10 
11 

A packet with PSN = 1129 is received on an IBA routing element 12 
(switch or router) port. 1 3 

Switches extract the multicast DLID from the LRH to determine if 

it corresponds to a multicast group. An implementation may main- 1 5 

tain this data as part of its internal route table, e.g. a bit-mask 16 

which corresponds to the output ports this packet should be for- i 7 

warded. ^ g 

Routers extract the GID from the GRH for IBA multicast or, for 1 9 
raw packet support, examine the IPv6 header or Ethertype within 20 
the RWH to determine if the packet corresponds to a multicast 21 
group. It uses this information to forward the packet to the next 
hop(s) to the destination(s). 

Switches or routers replicate the packet (implementation depen- 
dent) and forward the packet onto the output port(s). 

25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
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40 
41 
42 



InfinlBand^'^ Trade Association 



Page 179 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Link Layer 



October 24, 2000 
FINAL 



End Node 



QP4 




QP3 




QP2 


■Q 


a> 
> 














C 
0 

CO 


u 

0) 




ben< 


ecei\ 




c 
» 

CO 






OC 






DC 






a: 








HCAorTCA 






F 










Port 




End Node 
Port 



PKT#1129 



IBA^^witch 





Switch decodes inbound 
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Figure 61 Example IBA Unreliable Multicast Operation 



7.10.2 IBA Unreliable Multicast Operational Rules 



o7-13: IBA unreliable multicast is an optional capability. When imple- 
mented, it shall function based on the operational rules in Section 7.10.2. 
"IBA Unreliable Multicast Ooerational Rules." on page 180 . 

1 ) Multicast capability discovery, route table modification, status, and 
control shall be administered by an IBA management entity. Refer to 
15.2.5.7 MulticastForwardingRecord on page 683 and 14.2.5.12 Mul- 
ticastPorwardingTable on paoe 647 ". 

2) Within the network, packets are replicated within IBA switches and 
routers and forwarded to the corresponding output ports. 

3) Packets are not reliable with respect to acknowledgment generation 
nor delivery guarantees. 

4) Switches and routers may vary in their ability to support multicast 
packets and thus may have implementation-specific scheduling, re- 
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source management, and congestion policies which are outside the 1 

scope of IB A. 2 

5) Application multicast packets may be transmitted on VLs as assigned 3 
via the SL to VL mapping table by the subnet manager. The use of VL 4 
1 5 for multicast is prohibited. 5 

6) Application multicast packet headers may contain any SL as provided 6 
or derived from values provided by the subnet manager. 7 

7) Applications targeting a multicast group use a multicast group GID - ^ 
each CA or router port participating in a multicast group shall be as- 9 
signed the corresponding multicast group GID. 10 

8) Each CA, switch or router that supports multicast may participate in 
zero, one, or many multicast groups. 12 

13 

9) Multicast groups may span multiple subnets - a multicast capable 
router is required to forward packets to the next hop to the desti- 
nation. 15 

1 6 

1 0) Multicast packets may be generated by either a CA or a router. ^ ^ 

1 1 ) Multicast group membership is opaque to the participating end ^ g 
nodes, i.e. it is impossible to know which end nodes are participating 
within a multicast group and whether all participating end nodes 

within a multicast group reside within a local or remote subnet. 

Therefore, all IBA multicast packets shall contain a GRH with the 21 

destination multicast GID defined per the IBA addressing rules. 22 

12) The SGID within the GRH shall be set to the source port which ini- 23 
tially injected the packet into the network. 24 

25 

13) Messages shall be limited to single-packet messages. The maximum 
message size is set during the multicast group's creation. The group 
creator sets the Path MTU (PMTU) for the multicast group. A CA / 27 
router will query the SM for the PMTU during multicast group join op- 28 
oration. 29 

If an end node attempts to join a multicast group and is unable to 30 

accept the current PMTU, the join operation must fail. 31 

14) For each multicast group a CA port is participating in, the CA port ^2 
shall associate at least one locally managed QP 33 

• If a source port is also a destination port within the destination 

multicast group, the source shall internally replicate the packet 35 

within the channel interface to the associated local QPs. 36 

If the destination end node contains multiple locally managed 37 

QPs participating in a multicast group, the destination end node is 38 

responsible for internally replicating the packet within the channel 39 

interface and delivering a copy to each QP. 40 

41 
42 
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Destination end node delivers one inter- 
nally replicated copy of the packet to each 
locally managed QP participating in the 

End Node 



End Node 





If the source end node contains QPs which are 
targets of send operations, the end node shall 
internally replicate the packet and deliver it to 
each participating QP. Replication occurs 

Figure 62 Packet Delivery within an end node 

15) Unreliable multicast shall use the unreliable datagram transport 
service. Refer to the unreliable datagram transport services section 
for operational rules, constraints, verification, and error handling. 

1 6) A source end node shall set the destination QP within the packet 
header to OxFFFFFF. 
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7.10.3 Raw Packet Multicast 



Raw packets may be multicast using the same basic principles as unreli- 
able multicast IBA packets. 
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Figure 63 Example Raw Packet Multicast Operation 



7.10.3.1 Raw Multicast Operational Rules 



o7-14: Raw packet unreliable multicast is an optional capability. When im- 
plemented, it shall function based on the operational rules in 
Section 7.10.3. "Raw Packet Multicast." on oaoe 183 . 

1 ) Raw packet multicast is optional functionality defined within IBA. 

2) Raw multicast capability discovery, route table modification, status, 
and control shall be administered by an IBA management entity. 

3) Within the network, packets are replicated within IBA switches and 
fonwarded to the corresponding output ports. 

• Switches extract the multicast DLID from the LRH to determine 
the corresponding output ports. 
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4) Routing elements may vary in their ability to support multicast 1 
packets and thus may have implementation-specific scheduling, re- 2 
source management, and congestion / drop policies which are 3 
outside the scope of this architecture. ^ 

5) Raw multicast packets may be transmitted on any VL except VL 15. 5 

6) Raw multicast packets may be transmitted using any valid SL. 6 

7) IPv6 applications target a multicast group using an IPv6 multicast ad- 
dress. All other protocols use protocol specific addressing and reso- ^ 
lution. 9 



8) Each CA or router which supports multicast may participate in zero, 
one, or many multicast groups. 



10 
11 
12 
13 



9) Raw multicast groups may span multiple subnets - a multicast ca- 
pable router is required to forward packets to the next hop to the des 
tination. 

15 

1 0) Raw multicast packets may be generated by either a CA or a router. ^ g 

1 1 ) Messages shall be limited to single-packet messages. The maximum 1 7 
message size is a function of the PMTU between the source and des- 
tination end nodes. Raw protocol management will interact with IBA 
management entity to determine the maximum PMTU allowed. Raw 
multicast operations are not required to fail if the PMTU is too small - 
error recovery is the responsibility of the raw multicast group man- 21 
agement protocol. 22 

1 2) Raw packet support requires a minimum of one locally managed QP. 

An implementation may provide additional QPs based on implemen- 24 

tation-specific policies. As such, implementations are responsible for 25 

local raw packet replication and delivery. 26 

If a source port is also a destination port within the destination 27 

multicast group, the source shall internally replicate the packet 28 

within the channel interface to the associated local application 29 

targets. 30 

If the destination end node contains multiple participating applica- 31 

tion targets within a raw multicast group, the destination end node 32 

is responsible for internally replicating the packet within the chan- 33 
nel interface and delivering a copy to each target. 

13) Raw packet multicast shall use the IBA raw packet header formats 35 
and semantics. 

36 

7.10.4 Group Management 37 

IBA Release 1 .0 does not fully define the multicast group management 38 

protocol used to implement join and leave operations. However, the man- 39 

agement section does contain the management interface and associated 40 
MADs to implement a multicast group protocol. Refer to 15.2.5.17 MC- 

GroupRecord on page 692 ". ^2 
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7.11 Subnet MuLTiPATHiNG 1 

2 

3 

7.11.1 MULTIPATHING Requirements ON END NODE 4 

Each CA port is initialized with a LID plus an LMC (LID Mask Control) by 5 

the subnet nnanager. The value of LMC indicates the number of low order 6 

bits of the LID to mask when checking a received DLID against the port's 7 

DLID. LMC may take values from 0 to 7. Therefore, a port may be identi- g 

fied by 1 to 128 unicast LIDs. g 



10 
11 



07-66: When a link layer of a CA port checks that a unicast DLID in a re- 
ceived packet is a valid DLID for that port, it shall mask the number of low 
order bits indicated by the LMC before comparing the DLID to its assigned 12 
LID. 13 

14 

The subnet manager may program alternate paths through the subnet for ^ 5 
these various LIDs. The selection of which LID to use in the SLID and 
DLID of transmitted packets is covered in the Transport chapter. 



16 
17 

7.12 Error detection and handling 
7.12.1 Error Detection 



19 
20 

The following classes of errors are detectable by the link layer: 



22 
23 
24 



Single packet receive errors 

Local physical errors - errors indicative of bit errors on the at- 
tached physical link. Failures of ICRC, LPCRC and VCRC checks 
in the packet check state machines and entry to the bad packet 
state of the packet receiver state machine belong to this class. 26 

97 

Remote physical errors - errors indicative of bit errors on a link 
other than the attached physical link. Entry to the marked bad 28 
packet state of the packet receiver state machine belongs to this 29 
class. 30 

• Malformed packet errors - errors indicative of packets transmitted 31 
with inconsistent content. The packet was possibly bad at the 32 
source. It is also possible that the error was inserted by a switch. 33 
Programming errors of switch or port configuration by the SM 34 
may also create errors in this category. LVer error, Length error, ^5 
op_code error, VL error, and GRH_VL15 error belong to this 

class. These are all errors from the packet check state machines. 

37 

• Switch routing errors - errors indicative of an error in switch rout- 
ing. DLID errors are in this class. 

39 
40 
41 
42 
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• Buffer overrun - error indicative of an error in the state of the flow 1 

control machine in the link layer at the other end of the physical 2 

link. One cause of such an error can be an earlier packet with a 3 

physical error if buffers are not immediately reclaimed from bad ^ 
packets. 

5 

C7-67: An error in a received packet shall be classified as specified in g 

Section 7.12.1. "Error Detection," on page 185 . ^ 

C7-68: When error counters for the single packet receive errors are im- ^ 

plemented and one or more errors are detected in a received packet, then 9 

the counter associated with the error with the highest precedence as de- 10 

fined bv Section 7.3. "Packet Receiver States." on page 138 . Section 7.4. n 

"Data Packet Check." on page 141 . and Section 7.5. "Link Packet Check." ^2 
on page 144 shall increment and none of the other single packet error 
counters shall increment. 

14 

Receiver errors 1 5 



Transmitter errors 



16 
17 



Local link integrity - excessively frequent local physical errors. 
This error is caused by a marginal link. A more severe physical 

problem will be detected at the physical layer based on high fre- ^8 

quency of code violations. Detection of local link integrity errors is 1 9 

based on a count of local physical errors. The count starts at zero 20 

and shall be incremented for each packet received with a local 21 

physical error. If the current count is above zero, the counter shall ^2 
be decremented once for each packet received without a local 
physical error. When it exceeds local_phy_errors threshold, the 

local link integrity error shall be detected. 24 

25 

Excessive buffer overruns - buffer overrun errors persisting over 
multiple flow control update times. This error shall be detected 

when overrun_errors_threshold consecutive flow control update 27 

periods occur with at least one overrun error in each period. 28 

C7-69: Each port shall implement detection of local link integrity and ex- 29 

cessive buffer overrun errors as specified in Section 7.12.1. "Error Detec- 30 

tion," on page 185 . 31 

32 
33 

Flow control update - errors indicative of a failure of the flow con- 34 

trol machine at the other end of the link. For each VL active in the 35 
current port configuration, except VL 15 there shall be a watch- 
dog timer monitoring the arrival of flow control updates. If the fim- 
er expires without receiving an update, a flow control update error 

has occurred. The period of the watchdog timer shall be 400,000 38 

+3%/-51 % symbol times. This timer shall only run when PortState 39 

= Arm or Active. When PortState = ActiveD, this timer shall be re- 40 

set. When PortState = Initialize or when a flow control packet is 4^ 

received, the timer shall be reset. ^2 
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C7-70: Each port shall implement detection of flow control update errors 1 
as specified in Section 7.12.1 . "Error Detection." on page 185 . 2 

3 
4 



7.12.2 Error Recovery Procedures 



7.12.3 Error Notification 



7 
8 



The response to any single packet receive error is to discard the packet. 5 
No further recovery is necessary at the link layer. For some errors, the g 
data packet check state machine ( Section 7.4. "Data Packet Check." on 
page 141 ) allows a switch to forward a packet with an error marking it as 
bad by appending a bad VCRC value and the EBP delimiter as an alter- 
native to dropping the packet. 9 

10 

Local link integrity, excessive buffer overrun, and flow control update er- n 

rors all indicate errors that may be fixed by retraining or may be due to a 12 

hard fault. ^ 2 

14 

C7-71: Upon detecting local link integrity, excessive buffer overrun or flow 
control update errors, the link shall initiate retraining by asserting ^ ^ 

L_init_train (refer to 6.3.1 .2 L Init Train - Link Initiate Retraining on page 1 6 
131 ). 17 

18 
19 

Single packet receive error classes increment error counters as specified 20 
in management (Refer to 16.1.4 Optional Attributes on page 735 ). Note 21 
that at most one link layer error is detected per packet so each packet in- ^2 
crements one and only one of these counters. ^3 

Local link integrity, excessive buffer overrun, and flow control update are ^4 
counted and may produce a trap as specified in management. 25 

26 
27 
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Chapter 8: Network Layer 



1 

2 



8.1 Overview 



8.2 Packet Routing 
8.2.1 Overview 



This chapter describes the network layer within IBA. Within the IBA lay- 
ered architecture, this layer is responsible for routing packets between 
IBA subnets. This includes unicast and multicast operations. This chapter 
specifies routing between IBA subnets - it does not specify multi-protocol 
routing, i.e. routing IBA over non-IBA fabric types, nor does it specify how 
raw packets are routed between IBA subnets. 

This chapter, with the exception of section 8.4 Global Route Header 
Usage on page 192 . is informational in nature. As such, it does not specify 
IBA requirements; refer to Chapter 19: Routers on pace 830 for require- 
ments of IBA routers. Packet forwarding within an IBA subnet is done at 
the link layer by IBA switches; refer to Chapter 18: Switches on page 813 
for requirements of IBA switches. 



IBA supports a two-layer topological division. The lower layer is referred 
to as an IBA subnet. Packets are forwarded throughout the subnet uti- 
lizing IBA switches (the process of forwarding a packet from one link to an- 
other within a subnet is referred to as switching). The path that a packet 
takes through this layer is uniquely defined by its point of injection into the 
fabric, identified in the packet by the SLID field in the LRH, and the DLID 
and SL fields in its LRH. 

At the higher layer, subnets are Interconnected using routers (the process 
of forwarding a packet from one subnet to another is referred to as 
routing). Routing may be accomplished utilizing routers conforming to the 
IBA specification, and may also be accomplished using routers con- 
forming to other specifications (e.g. utilizing the Internet Protocol (IP) suite 
of specifications). The series of subnets through which a packet passes 
Is not defined by IBA; however, several fields are provided in the Global 
Route Header to enable routers to make this decision. These fields in- 
clude SGID, DGID, TCIass and FlowLabel. Additionally, a router might 
use fields from other headers, e.g. the SL field in the LRH to determine a 
mapping to TCIass. Regardless of the mechanism used to in forwarding 
decisions, IBA requires that the path be symmetric with respect to SGID 
and DGID. This means that if a valid path exists from an SGID to a DGID, 
then IBA requires that a valid path also exist swapping the values of DGID 
and SGID. 
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The requirements of IBA routers are specified in Chapter 19: Routers on 1 
page 830 . Interconnection of IBA subnets utilizing IBA routers is intended 2 
to preserve IBA intra-subnet behavior across subnets. 3 



8.2.2.3 Service Levels 



4 

5 



Use of other routing technologies is beyond the scope of IBA; however, 
the architecture is intentionally crafted to enable this capability, especially 
utilizing IP version 6 as specified by IETF RFC 2460 and other associated ^ 
IETF RFCs. 7 

8 

A global IBA fabric consists of one IBA subnet or multiple IBA subnets in- 9 
terconnected via routers. As described above, this global fabric may also >|q 
include non-IBA interconnects between IBA subnets, as well as gateways 
to non-IBA fabrics. 

12 

8.2.2 Global Fabric Characteristics 13 

14 

This section describes the characteristics of a global fabric interconnected 
exclusively with IBA routers. While beyond the scope of IBA, global fab- ^ ^ 
rics interconnected with non-IBA technology may also exhibit some or all 1 6 
of these characteristics. 1 7 

18 

8.2.2.1 Inheritance of Subnet Requirements 

All the packet delivery characteristics of a subnet are inherited by the 20 
global fabric, except for virtual lane 1 5 subnet management packets 21 
(since subnet management occurs at the subnet level, these packets do 
not transit routers). 

23 

8.2.2.2 Packet Errors and Error Detection 24 

IBA specifies an invariant CRC that is appended to all IBA packets except 

raw packets (refer to section 7.8.1 Invariant CRC (ICRC) - 4 Bytes on 26 

page 161 ). This CRC covers all of the IBA packet fields that do not require 27 

modification to effect IBA routing. End-to-end data integrity assurance is 28 

provided by retaining this CRC unmodified as the packet transits the 29 

global fabric. 2o 



31 
32 
33 



Service levels and virtual lanes are supported throughout the global 
fabric. This is accomplished by mapping service level to traffic class in the 
GRH, and vice versa. The mapping function itself, as is the interpretation 34 
of service level, is beyond the scope of IBA. 35 

36 

8.2.3 Support for Multiple Global Paths 37 

The information required to route a packet within a subnet and between 38 

subnets is contained in the packet's local route header and global route 39 

headers, respectively. Unlike many network protocols, IBA does not re- 4Q 

quire a packet to contain a global route header unless the packet is either 

destined for a device that is not on the same subnet or the packet is a mul- 

^ 42 
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ticast packet. However, any packet except subnet management packets 1 
may contain a global route header (subnet management packets are de- 2 
fined in 14.2.1 Datagram Formats and Use on page 611 .) 3 

4 

The identification and utilization of multiple paths between two channel 
adapters on different subnets is hierarchical and involves similar but inde- 
pendent mechanisms within subnets and across subnets. 6 

7 

Within subnets, multiple paths between two channel adapters are identi- 8 

fied by multiple LIDs. That is, a port may effectively be assigned multiple g 
LIDs using a LID/LMC combination Chapter 4: Addressing on page 108 . 

The source channel adapter indicates a path via its selection of one of the ^ ^ 
LIDs assigned to the destination port. 

Likewise, channel adapters have the option to support the assignment of 1 3 

multiple GIDs. In the case of global routing across subnets, the LID indi- 14 

cates which of the valid paths is to be used within the subnet (i.e. switch 15 

forwarding) and the GID indicates which of the valid paths is to be used 1 5 

between subnets (i.e. router forwarding). ^-^ 

1 R 

As a packet transits a subnet, its SLID and DLID fields remain unchanged. 

As a packet transits between subnets (i.e. through a router), the router up- ^ ^ 

dates the SLID to that of its own LID and the DLID to the LID of the next 20 

router or final destination, as appropriate. 21 

22 

Note that for global routing, this provides two degrees of freedom for a 23 

source channel adapter to select a path through the fabric. Selection of ^4 
the LID determines the route through the subnet to the first router. Selec- 
tion of the GID determines the route taken after reaching the first router. 

Each router along the path may choose the path through a subnet to the 26 

next router (or final destination) via its selection of the LID for the next 27 

router (or final destination). Furthermore, since the DLID may contain 28 

LMC bits of multipath data, the router may use the DLID as part of its route 29 
determination algorithm. 



The decision process that routers use for forwarding packets is not spec- 
ified by IBA; however, routers may rely on various combinations of Desti- 32 
nation GID, Source GID, SL, TCIass, and FlowLabel fields, among other 33 
factors, to determine the forwarding path and flows that must exhibit in- 34 
order delivery. Channel Adapters and/or ingress routers may label flows 35 
of packets that are expected to be delivered in order with the same Flow- 
Label in the global route header. While IBA routers utilize LIDs and GIDs 
to determine paths, the FlowLabel may be used by non-IBA routers to de- 
termine paths. 38 

39 

40 
41 
42 
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8.2.4 Global Multicast 



8.3 Global Route Header 



IBA supports an unreliable multicast mechanism. A detailed description of 
this mechanism may be found in section 7.10 IBA and Raw Packet Multi- 
cast on page 179 . Implementation of this mechanism is optional in IBA de- 
vices (including switches and routers). Multicast packets within a given 
multicast group, i.e. multicast packets that share a common multicast 
GID, may be sourced by a single device or by multiple devices. Since 
routers are not fully specified by IBA, routers may vary in their ability to 
support multicast packets and may have implementation specific. 



Figure 64 on page 191 illustrates the format of the Global Route Header 
that is used for inter-subnet routing. 



Figure 64 Global Route Header (GRH) 
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Global route headers are not required in all packets (see section 8.4.1 
Global Route Header Generation on page 192 for details). The presence 
of a Global Route Header is indicated in the Local Route Header as spec- 
ified in 7.7.5 Link Next Header (LNH) - 2 bits on page 160 . The following 
subparagraphs describe the fields in the GRH: 

8.3.1 IP Version (IPVer) - 4 bits 

Indicates the version of the GRH; always set to 6. 

8.3.2 Traffic Class (TClass) - 8 bits 

This field is used to communicate service level end-to-end, i.e. across 
subnets. The mapping of specific traffic class to specific TCIass values is 
not specified by IBA and may vary by implementation. 
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8.3.3 Flow Label (FlowLabel) - 20 bits 1 

This field may be used to identify a sequence of packets that must be de- 2 
livered in order. 3 

4 

8.3.4 Payload Length (PayLen) - 16 bits 5 

This field specifies the length of the packet, in bytes, starting from the first 6 
byte after the global route header up to and including the last byte pre- 7 
ceding the VCRC. 8 

9 

10 
11 
12 
13 

This field indicates the number of hops (i.e. the number of routers tran- 
sited) that the packet is permitted to take prior to being discarded. This 
ensures that a packet will not loop indefinitely between routers should a 
routing loop occur. Setting this value to 0 or 1 will ensure that the packet ^ ^ 
will not be forwarded beyond the local subnet. 17 

18 

8.3.7 Source Global Identifier (SGID) - 64 bits ig 

This field identifies the port that injected the packet into the global fabric. 20 
Additional information on the format and use of GID's may be found in 21 
Chapter 4: Addressing on page 108 . 22 

23 
24 



8.3.5 Next Header (NxtHdr) - 8 bits 

This field indicates what header, if any, follows the global route header. 

8.3.6 Hop Limit (HopLmt) - 8 bits 



8.3.8 Destination Global Identifier (DGID) - 64 bits 



This field identifies the final destination port of the packet, or to the multi- 25 
cast group that represents the set of ports to which the packet is to be de- 
livered. Additional information on the format and use of GID's may be 
found in Chapter 4: Addressing on oaoe 108 . 

28 

8.4 Global Route Header Usage 29 

The following subsections describe the usage of the global route header: 

31 

8.4.1 Global Route Header Generation 32 

C8-1: A channel adapter initiating a packet shall include a global route ^3 
header if any of the following conditions apply: 34 

35 

• The packet is a multicast packet. 35 

• The final destination of the packet is a port of a device that is not on 37 
the same subnet as the port that initially injects the packet into the 33 
fabric and both the injecting and receiving ports are connected to IBA 39 
subnets. 

40 

08-I : A channel adapter, switch, or router initiating a packet may include 4^ 

a global route header in any packet except for SMPs. 42 
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If a global route header is included, the fields are loaded by the initiating 1 

channel adapter, switch, or router as follows: 2 

3 

C8-2: IPVer: If a global route header is included in a packet, this field shall ^ 

be set to 6. _ 

5 

C8-3: TCIass: If a global route header is included in a packet, this field ^ 

shall either be set to zero or to an appropriate TCIass value by the in- 7 

jecting channel adapter. Each router maps TCIass to a SL appropriate for 8 

the subnet on which it will inject the packet. This mapping function is not 9 
specified by IBA. 



FlowLabel: The use of this field is not required by IBA. 



10 
11 
12 

08^: If a global route header is included in a packet, and FlowLabel is not 1 3 
used, it shall be set to zero. 14 

15 

C8-5: If a global route header is included in a packet and FlowLabel is 
used, all packets that must be delivered in order with respect to each other ^ ^ 
shall be identified by a constant, non-zero value inserted in the FlowLabel 
field. 

19 

This implies that if a given QP uses a non-zero flow label, it must use the 20 
same flow label on all packets emitted from that QP that are destined for 21 
a given remote QP. Different QPs transmitting to a given destination may 22 
use the same or different flow labels. Flow labels may be shared among 23 



QPs. 



NxtHdr: The use of this field varies depending on whether the packet is a 
raw or non-raw packet. 



24 



C8-6: PayLen: If a global route header is included in a packet, this field 

shall be loaded with the length of the packet, in bytes, starting from the 26 

first byte after the global route header up to and including the last byte of 27 

payload (excluding pad bytes, if any). 28 

29 
30 
31 

C8-7: For non-raw IBA packets that include a GRH, the NxtHdr shall con- ^2 
tain {{he value to be provided by the IETF - this will be fixed in errata once 33 
assigned), 34 

35 

C8-8: For raw packets that include a GRH, the contents of NxtHdr shall be gg 
set to the identifier for the next header as defined in IETF RFC 1700 et. 
seq. 

38 

C8-9: HopLmt: If a global route header is included in a packet, this field 39 
shall be set to the number of hops (i.e. the number of routers that may be 40 
transited) that the packet is permitted to take prior to being discarded. 41 

42 
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C8-10: SGID: If a global route header is included in a packet, this field 1 
shall be set to one of the GID's assigned to the port that will inject the 2 
packet into the fabric. 3 



4 

5 



C8-11: DGID: If a global route header is included in a packet, this field 
shall be set to one of the GID's assigned to the port that is the final desti 
nation of this packet, or to the multicast GID that represents the set of ^ 
ports to which the packet is to be delivered. 7 

8 

8.4.2 Global Route Header Modification 9 

This section describes the modifications that may and must be made to 10 
the global route header by IBA routers when forwarding packets between 1 1 
subnets. Note that modification of these fields implies updating the 12 
packet's variant CRC defined in 7.8,2 Variant CRC (VCRC) - 2 Bvtes on ^ ^ 
page 163 . These changes do not affect the packet's invariant CRC de- 
fined in 7,8.1 Invariant CRC (ICRC)-4 Bvtes on page 161 . 

15 

C8-12: IPVer: This field shall not be changed by IBA routers. 16 

17 

TCIass: This field is used to communicate service level end-to-end, i.e. 
across subnets. Routers utilize this field to determine an appropriate SL 
for forwarding on the next subnet. This mapping function is not specified 
by IBA. 

21 

C8-1 3: The TCIass field, if non-zero, shall not be modified by IBA routers. 22 

23 

The use of TCIass by routers when it contains zero is not defined by IBA. 24 

25 

FlowLabel: This field may be used to identify a sequence of packets that 25 
must be delivered in order. The use of this field is not required by IBA. If 
not used, it is left unchanged. If used, all packets that must be delivered 
in order with respect to each other shall be identified by a constant, non- 28 
zero value inserted in this field in each packet. 29 

30 

o8-2: The router may change the value of FlowLabel; however, it must 31 
use the same flow label for all packets that must be delivered in order, 22 
which includes all traffic between any given two QPs. 

C8-14: PayLen: IBA routers shall not modify the content of PayLen. 

35 

C8-15: NxtHdr: IBA routers shall not modify the content of NxtHdr. 36 

37 

08-16: HopLmt: IBA routers shall discard packets that contain a value of 33 
one in the HopLmt field. Otherwise, IBA routers shall decrement the 
HopLmt field by one. 

C8-17: SGID: IBA routers shall not modify the content of SGID 41 

42 
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C8-1 8: DGID: IBA routers shall not modify the content of DGID. 1 

2 

8.4.3 Global Route Header Verification 3 

This section describes the verification that must be performed by the net- 4 
work layer at the final destination of the packet. This verification assumes 5 
that the packet has passed the verification required of the lower layers to g 
permit the packet to be presented to the network layer. ^ 

C8-19: The network layer shall silently discard, with the exception of ad- ^ 
justing any applicable management counters specified elsewhere in this 9 
specification, packets that meet any of the following conditions: 1 0 

11 

Value of IPVer is not 6. 12 

• The value of DGID does not equal one of the GID values assigned to 13 
the port that received the packet. 14 

If none of the above conditions require discard of the packet by the net- 1 5 

work layer, the network layer presents the packet to the transport layer. 1 6 

Note that the other layers, including the transport layer, may require addi- -| 7 

tional verification of fields within the global route header. ^ g 

19 
20 
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Chapter 9: Transport Layer i 

2 

3 



9.1 Overview 



16 
17 



5 
6 

Each IBA packet contains a transport header. The transport header con- 7 
tains the information required by the endnode to complete the specified g 
operation, e.g. delivery of data payload to the appropriate entity within the g 
endnode such as a thread or 10 controller. This chapter defines the trans- 
port services used by IBA. 

11 

The client of an IBA channel adapter communicates with the transport 1 2 
layer by manipulating a "queue pair" (QP) made up of a Send work queue 1 3 
and a Receive work queue. For a host platform, the client of the transport ^ 4 
layer is the Verbs software layer. The client posts buffers or commands to ^ ^ 
these queues and hardware transfers data from or into the buffers. 
Throughout this chapter, a QP that initiates an operation, i.e. injects a 
message into the fabric, is referred to as the requester and the QP that 
receives the message is referred to as the responder. 1 8 

19 

When a QP is created, it is associated with one of five transport service 20 
types. The transport service describes the degree of reliability and to what 21 
and how the QP transfers data. ^2 

The five transport service types are: 

24 

1) Reliable Connection 25 

2) Reliable Datagram 26 

27 

3) Unreliable Datagram 

2o 

4) Unreliable Connection 29 

5) Raw IPv6 Datagram & Raw Ethertype Datagram 30 

OA 

Table 256 Channel Adapter Attributes on page 799 lists which of these 
services are required for Host Channel Adapters and Target Channel 32 
Adapters. Table 34 below compares several key attributes of these five 33 
transport service types. 34 

35 

Reliable transport services use a combination of sequence numbers and 3g 
acknowledgment messages (ACK / NAK) to verify packet delivery order, 
prevent duplicate packets and out-of-sequence packets from being pro- 
cessed, and to detect missing packets. Upon error detection, e.g. a 
missing packet, the missing packet along with all subsequent packets will 39 
be retransmitted by the requestor. IBA does not support selective packet 40 
retransmission nor the out-of-order reception of packets. 41 

42 
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An IBA operation is defined to include a request message and, for reliable 
services, Its corresponding response. Thus, the request message is gen- 
erated by a requester, and a response, if one exists, is generated by the 
responder. 

Requester Responder 



IBA Operation 



request message 



\ 




response 
(acknowledge) 

Figure 65 IBA Operation 



A request message consists of one or more IBA packets. The packets of 
a request message are called request packets. A response, except for an 
RDMA READ Response, consists of exactly one packet. A response is 
also called an acknowledge. The response packet acknowledges receipt 
of one or more packets. The response may acknowledge the receipt of 
packets that comprise anywhere from a portion of a request message to 
multiple request messages. 

Unreliable transport services do not use acknowledgment messages. 
They do however generate sequence numbers. This allows a responder 
to detect out-of-sequence or missing packets and to perform local re- 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 197 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand^'^ Architecture Release 1.0 
Volume 1 - General Specifications 



Transport Layer 



October 24, 2000 
FINAL 



covery processing. The specifics of any recovery processing for unreliable 1 
datagrams are outside the scope of the IBA specification. 2 

3 

Table 34 Comparison of IBA Transport Service Types . 



Attribute 


Reliable 
Connection 


Reliable 
Datagram 


Unreliable 
Datagram 


Unreliable 
Connection 


Raw Datagram 
(both IPv6 & 
ethertype) 


5 
3 
7 
3 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 


Scalability (M processes on N 
Processor nodes connmunicat- 
ing with all processes on all 
nodes) 


IVI^*N QPs 

required on each 
processor node, 
per CA 


M QPs required 
on each proces- 
sor node, per CA. 


M QPs required 
on each proces- 
sor node, per CA. 


IVI^*N QPs required 
on each processor 
node, per CA. 


1 QP required on 
each end node, 
per CA. 


Reliability 


Corrupt data detected 


Yes 


Data delivery guarantee 


Data delivered exactly once 


No guarantees 


Data order guaranteed 


Yes. per connec- 
tion 


Yes, packets from 
any one source 
QP are ordered to 
multiple destina- 
tion QPs. 


No 


Unordered and dupli- 
cate packets are 
detected. 


No 


Data loss detected 


Yes 


No 


Yes 


No 


Error recovery 


Reliable. Errors are detected at both 
the requestor and the responder The 
requestor can transparently recover 
from errors (retransmission, alternate 
path, etc.) without any involvement of 
the client application. OP processing 
is halted only if the destination is 
inoperable or all fabric paths between 
the channel adapters have failed. 


Unreliable. Pack- 
ets with some 
types of errors 
may not be deliv- 
ered. Neither 
source nor desti- 
nation QPs are 
informed of 
dropped packets. 


Unreliable. Packets 
with errors, including 
sequence errors, are 
detected and may be 
logged by the 
responder. The 
requestor is not 
informed. 


Unreliable. Pack- 
ets with errors are 
not delivered. The 
requestor and 
responder are not 
informed of 
dropped packets. 


RDMA and ATOMIC Opera- 
tions 


Yes 


Yes 


No 


Yes* RDMA WRITEs 
No: RDMA READs & 
ATOMICS 


No 


Bind Memory Window 


Yes 


Yes 


No 


Yes 


No 


IBA Unreliable Multicast Sup- 
port 


No 


No 


Yes 


No 


No 


28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 


Raw Multicast 


No 


No 


No 


No 


Yes 


Message Size 


Message size 0 to 2^*" bytes. Smaller 
max size may be negotiated by Con- 
nection Management. A message 
may consist of multiple packets. 


Single PMTU 
packetdatagrams 
- 0 to 4096 bytes. 


Message size 0 to 2^*" 
bytes. Smaller max 
size may be negoti- 
ated by Connection 
Management. A mes- 
sage may consist of 
multiple packets. 


Single PMTU 

packetdatagrams 
- 0 to 4096 bytes. 


Connection Oriented? 


Connected. The 

client connects 
the local OP to 
one and only one 
remote QP. No 
other traffic flows 
over these QPs. 


Connectionless. 

Appears connec- 
tionless to the cli- 
ent - uses one or 
more End-to-End 
contexts per CA 
to provide reliabil- 
ity service. 


Connectionless. 

No prior connec- 
tion is needed for 
communication. 


Connected. The cli- 
ent connects the local 
QP to one and only 
one remote QP. No 
other traffic flows over 
these QPs. 


Connectionless. 

No prior connec- 
tion is needed for 
communication. 
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9.2 Base Transport Header 



Base Transport Header (BTH) contains fields always present for all IBA 
transport services - it is not present in Raw packets. The presence of BTH 
is indicated by the Link Next Header (LRH:LNH) field. 

C9-1: All IBA transport services shall include a Base Transport Header 
(e.g. it is not present in Raw packets). 

Figure 66 Base Transport Header (BTH) 
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31-24 
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15-8 7-0 


0-3 


OpCode 


SB M Pad TVer 


Partition Key 


4-7 


Reserved 8 
(masked in 
ICRC) 


Destination QP 


8-11 


A Reserved 7 


PSN - Packet Sequence Number 



9.2.1 Operation Code (Opcode) 

The OpCode field defines the interpretation of the remaining header and 
payload bytes. The OpCode list definition is shown in Table 35 OpCode 
field on page 200 . 
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C9-2: Table 35 shall be used to define the OpCode parameter in the BTH 1 
as well as the headers and payload that follow the BTH. 2 

3 

Table 35 OpCode field . 



Code[7-5] 


Code[4-0] 


Description 


Packet Contents following the Base 
Transport header^ 


5 
6 




00000 


SEND First 


PayLd 


7 


000 


00001 


SEND Middle 


PayLd 


8 


00010 


SEND Last 


PayLd 


9 


KeiiaDie 


00011 


SEND Last with Immediate 


ImmDt, PayLd 


10 


Connection (RC) 


00100 


SEND Only 


PayLd 


11 




00101 


SEND Only with Immediate 


ImmDt, PayLd 


12 




00110 


RDMA WRITE First 


RETH, PayLd 


13 




00111 


RDMA WRITE Middle 


PayLd 


14 




01000 


RDMA WRITE Last 


PayLd 


15 




01001 


RDMA WRITE Last with Immediate 


ImmDt, PayLd 


16 




01010 


RDMA WRITE Only 


RETH, PayLd 


17 




01011 


RDMA WRITE Only with Immediate 


RETH, ImmDt. PayLd 


18 




01100 


RDMA READ Request 


RETH 


19 




01101 


RDMA READ response First 


AETH. PayLd 


20 




01110 


RDMA READ response Middle 


PayLd 


21 




01111 


RDMA READ response Last 


AETH. PayLd 


22 




10000 


RDMA READ response Only 


AETH, PayLd 


23 




10001 


Acknowledge 


AETH 


24 




10010 


ATOMIC Acknowledge 


AETH, AtomicAckETH 


25 




10011 


CmpSwap 


AtomicETH 


26 




10100 


FetchAdd 


AtomicETH 


27 




10101-11111 


Reserved 


undefined 


28 




00000 


SEND First 


PayLd 


29 


001 


00001 


SEND Middle 


PayLd 


30 


00010 


SEND Last 


PayLd 


31 


1 In ^ A 1 1 ^ n 1 ^ 

unreiiaDie 


00011 


SEND Last with Immediate 


ImmDt, PayLd 


32 


Connection (DC) 


00100 


SEND Only 


PayLd 


33 




00101 


SEND Only with Immediate 


ImmDt, PayLd 


34 




00110 


RDMA WRITE First 


RETH. PayLd 


35 




00111 


RDMA WRITE Middle 


PayLd 


36 




01000 


RDMA WRITE Last 


PayLd 


37 




01001 


RDMA WRITE Last with Immediate 


ImmDt, PayLd 


38 




01010 


RDMA WRITE Only 


RETH, PayLd 


39 




01011 


RDMA WRITE Only with Immediate 


RETH. ImmDt, PayLd 


40 




01100-11111 


Reserved 


undefined 


41 



42 
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Table 35 OpCode field 



Code[7-5] 


Code[4-0] 


Description 


Packet Contents following the Base 
iraiispori ncauer^ 




00000 


SEND First 


RDETH, DETH, PayLd 


010 


00001 


SEND Middle 


RDETH, DETH, PayLd 


00010 


SEND Last 


RDETH, DETH, PayLd 


Reliable 


00011 


SEND Last with Immediate 


RDETH, DETH, ImmDt, PayLd 


Datagram (RD) 


00100 


SEND Only 


RDETH, DETH, PayLd 


00101 


SEND Only with Immediate 


RDETH, DETH, ImmDt, PayLd 




00110 


RDMA WRITE First 


RDETH, DETH, RETH, PayLd 




00111 


RDMA WRITE Middle 


RDETH, DETH, PayLd 




01000 


RDMA WRITE Last 


RDETH, DETH, PayLd 




01001 


RDMA WRITE Last with Immediate 


RDETH, DETH, ImmDt, PayLd 




01010 


RDMA WRITE Only 


RDETH, DETH, RETH, PayLd 




01011 


RDMA WRITE Only with Immediate 


RDETH, DETH, RETH, ImmDt, PayLd 




01100 


RDMA READ Request 


RDETH, DETH, RETH 




01101 


RDMA READ response First 


RDETH, AETH, PayLd 




01110 


RDMA READ response Middle 


RDETH, PayLd 




01111 


RDMA READ response Last 


RDETH, AETH, PayLd 




10000 


RDMA READ response Only 


RDETH, AETH, PayLd 




10001 


Acknowledge 


RDETH, AETH 




10010 


ATOMIC Acknowledge 


RDETH, AETH, AtomicAckETH 




10011 


CmpSwap 


RDETH, DETH, AtomicETH 




10100 


FetchAdd 


RDETH, DETH. AtomicETH 




10101-11111 


Reserved 


undefined 


011 


00000-00011 


Reserved 


undefined 


Unreliable 


00100 


SEND only 


DETH, PayLd 


Datagram (UD) 


00101 


SEND only with Immediate 


DETH, ImmDt, PayLd 




00110-11111 


Reserved 


undefined 


100-101 


00000-11111 


Reserved 


undefined 


110-111 


00000-11111 


Manufacturer Specific OpCodes 


undefined 



a. All Opcodes have the ICRC and VCRC attached. 

9.2-2 Reserved Transport Function Opcodes 



For future expansion of its transport layer, IBA provides Reserved and 
Manufacturer Defined BTH OpCodes. Two blocks of undefined OpCodes 
are specified: one for future revisions of the IBA and one block for manu- 
facturer specific functions. Manufacturer Defined opcodes should not be 
used between devices until the devices are clearly identified as supporting 
those opcodes. 
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9.2.3 Solicited Event (SE) - 1 bit 



The requester sets this bit to 1 to indicate that the responder shall invoke 
the CQ event handler. Additional operational guidelines: 

• A CQ event handler must be configured for the target CQ. 

• The SE bit should only be set in the last or only packet of a 
SEND, SEND with Immediate, or ROMA WRITE with Immediate. 

• CQ event handler is invoked after a completion event is written to 
the CQ. 

SE bit is not considered a part of packet header validation, i.e. receipt of 
a packet with this bit set that does not meet the invocation requirements 
will not result in a NAK being generated. 

09-3: For an HCA, if an inbound request packet has the Solicited Event 
bit in the BTH to 1 and the additional SE operational guidelines are valid, 
it shall invoke the CQ event handler. 

o9-1 : For a TCA supporting Solicited Events, if an inbound request packet 
has the Solicited Event bit in the BTH to 1 and the additional SE opera- 
tional guidelines are valid, it shall invoke the CQ event handler. 

C9-4: The responder shall not consider the SE bit in the BTH part of the 
packet header validation. 



Used to communicate migration state. If set to one, indicates the connec- 
tion or EE context has been migrated; if set to zero, it means there is no 
change in the current migration state. See Automatic Path Migration 
within the Chapter 17: Channel Adapters on page 790 . 

9.2.5 Pad Count (PadCnt) - 2 bits 

Packet payloads are sent as a multiple of 4-byte quantities. Pad count in- 
dicates the number of pad bytes - 0 to 3 - that are appended to the packet 
payload. Pads are used to "stretch" the payload (payloads may be zero or 
more bytes in length) to be a multiple of 4 bytes. 

9.2.6 Transport Header Version (TVer) - 4 bits 

Specifies the version of the IBA Transport used for this packet. This ver- 
sion applies to all of the transport fields including the BTH, extended 
header and the invariant CRC - this field is set to 0x0. If a receiver does 
not support the Transport Version specified then the packet is discarded. 

09-5: Requesters and responders using IBA transports shall generate 
IBA transport packets with BTH:TVer = 0x0. 



9.2.4 MigReq (M) - 1 Bit 
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9.2.7 Partition Key (P_Key) - 16 bits 

P_Key identifies the partition that the destination QP (RC, UC, UD) or EE 
Context (RD) is a member. 

9.2.8 Destination QP (DestQP) - 24 bits 

This field specifies the destination queue pair (QP) identifier. 

9.2.9 Reserve 8 (Resv8) - 8 bits 

Reserved (variant) - 8 bits. Transmitted as 0, ignored on receive. This field 
is not included in the invariant CRC. 

C9-6: When generating a packet, the sender shall set the Resv8 field to 
zero. The receiver shall ignore this field. 



9.2.10 AckReq (A) - 1 Bit 



Requests responder to schedule an acknowledgment on the associated 
QP. 



9.2.11 Reserve 7 (resv7) - 7 bits 

Transmitted as 0, ignored on receive. This field is included in the invariant 
CRC. 

C9-7: When generating a packet, the sender shall set the Resv7 field to 
zero. The receiver shall ignore this field. 

9.2.12 Packet Sequence Number (PSN) - 24 bits 

This field is used to identify the position of a packet within a sequence of 
packets. All IBA requesters shall generate a monotonically increasing 

(modulo 2^^) PSN when originating a packet. Depending upon the trans- 
port service type and / or implementation requirements, a responder may 
validate the PSN to detect missing packets. 

9.3 Extended Transport Headers 

9.3.1 Reliable Datagram Extended Transport Header (RDETH) - 4 Bytes 

Reliable Datagram Extended Transport Header (RDETH) contains the 
End-to-End Context identifier. 

Figure 67 Reliable Datagram Extended Transport Header (RDETH) 
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9.3.1.1 Reserve - 8 bits 

o9-2: If a CA implements Reliable Datagram functionality, then when gen- 
erating a packet, the sender shall set this field to 0x0. The receiver shall 
ignore this field. 

9.3.1.2 End-to-End (EE) Context - 24 bits 

This field indicates the End-to-End (EE) Context used for this packet. EE 
context is a unique endnode identifier used to multiplex / demultiplex reli- 
able datagram packets between any two end nodes. The EE-Context pro- 
vides a context for reliable transfer state similar to that used for reliable 
connection. 

9.3.2 Datagram Extended Transport Header (DETH) - 8 Bytes 

Datagram Extended Transport Header (DETH) contains the additional 
transport fields for reliable and unreliable datagram service. 

Figure 68 Datagram Extended Transport Header (DETH) 
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Reserve 


Source QP 



9.3.2.1 Q Key - 32 bits 



9.3.2.2 Reserve - 8 bits 



This field is required to authorize access to the destination queue. The re- 
sponder compares this field with the destination's QP Q_Key. 



C9-8: When generating a packet, the sender shall set this field to 0x0. The 
receiver shall ignore this field. 

9.3.2.3 Source QP (SrcQP) - 24 bits 

This field specifies the source queue pair (QP) identifier. This is used as 
the destination QP for response packets. 
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9.3.3 RDMA Extended Transport Header (RETH) - 16 Bytes 

RDMA Extended Transport Header (RETH) contains the additional trans- 
port fields for RDMA operations. 

Figure 69 RDMA Extended Transport Header (RETH) 
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Virtual Address (31-0) 


8-11 


R_Key 


12-15 


DMA Length 



9.3.3.1 Virtual Address (VA) - 64 bits 

Start address of buffer. RDMA VA may start on any byte boundary. 



9.3.3.2 R Key - 32 bits 



R_Keys have the following properties: 

• A R_Key acts as a protection key to access the specified nnemory 
address and range for a given operation, i.e. it is a protection 
mechanism to insure proper access to the target memory. The re- 
sponder correlates the R_Key to the local protection mechanisms 
to validate the requester's access rights. 

• A R_Key must be exported to the requester - this process (also 
includes the export of the starting virtual address and memory 
size, i.e. length) is outside the scope of this section. 

• Access rights are granted for any combination of RDMA READ, 
RDMA WRITE, and ATOMICs - including none and all. 

• Each Memory Region or Window has a single valid R_Key at any 
given moment. A virtually contiguous range of memory locations 
can have multiple Regions or Windows associated with it concur- 
rently, each with an associated R_Key. 

• A R_Key can be exported to multiple remote responders. 

• R_Keys are used only for RDMA and ATOMIC Operations. 
R_Key Is contained within the packet header. 

A responderthat supports RDMA and / or ATOMIC Operations shall verify 
the R_Key, the associated access rights, and the specified virtual ad- 
dress. The responder must also perform bounds checking (i.e. verify that 
the length of the data being referenced does not cross the associated 
memory start and end addresses). Any violation must result in the packet 
being discarded and for reliable services, the generation of a NAK. 
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9.3.3.3 DMA Length (DMAlen) - 32 bits 

This field indicates the length, in bytes, of the remote DMA operation. 

C9-9: For an HCA performing RDMA operations, the minimum length 
specified in the DMALen field is 0; the maximum length is 

o9-3: If a TCA implements RDMA functionality, the minimum length spec- 
ified in the DMALen field is 0; the maximum length is 2^^ 



9.3.4 ATOMIC EXTENDED Transport Header (AtomicETH) - 28 Bytes 

ATOMIC Extended Transport Header (AtomicETH) contains the additional 
transport fields for ATOMIC Request operations. 

Figure 70 ATOMIC Extended Transport Header (AtomicETH) 
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9.3.4.1 Virtual Address (VA) - 64 bits 

Start address of buffer. 



9.3.4.2 R KEY -32 BITS 



R_Key used to verify remote access to the specified virtual address. See 
9.3.3.2 R Key - 32 bits on oaae 205 . 



9.3.4.3 Swap (Add) Data (SwapDt) - 64 bits 

The data operand used in ATOMIC Operations. In a CmpSwap operation 
this field is swapped into the addressed buffer if the CmpDt matched the 
existing buffer contents. In a Fetch Add operation this field is added to the 
contents of the addressed buffer. 

9.3.4.4 Compare Data (CmpDt) - 64 bits 

The data operand used in Compare portion of the CmpSwap ATOMIC Op- 
eration. 
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9.3.5 ACK Extended Transport Header (AETH) - 4 Bytes 

ACK Extended Transport Header (AETH) contains the additional trans- 
port fields for ACK packets. The ACK Extended Transport header is in- 
cluded in all ACK and the first and last packet of RDMA READ Response 
messages. 

Figure 71 Acknowledge Extended Transport Header (AETH) 
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9.3.5.1 Syndrome 

This field indicates if this is an ACK or NAK. If the packet is an ACK and 
the QP is associated with Reliable Connection transport service, the syn- 
drome also provides the Limit Sequence Number (LSN) - see 9.7.7.2 End- 
to-End (Message Level) Flow Control on page 296 . If packet is a NAK, it 
indicates the error code. For RNR NAK, this field indicates the re- 
sponder's requested timer to be used before retransmitting the request. 

9.3.5.2 Message Sequence Number (MSN) 

Monotonically increasing (modulo 7^^) sequence number of the last mes- 
sage completed at the responder. This field is used to optimize completion 
processing at the requester. 

9.3.5.3 ATOMIC ACKNOWLEDGE Extended Transport Header (AtomicAckETH) - 8 Bytes 

ATOMIC Acknowledge Extended Transport Header (AtomicETH) con- 
tains the additional transport fields for ATOMIC response operations. 

Figure 72 ATOMIC Acknowledge Extended Transport Header (AtomicAckETH) 
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9.3.5.4 Original Remote Data (OrigRemDt) - 64 bits 

The data result from an ATOMIC Operation. This is the initial contents 
read from the remote memory buffer. 
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9.3.6 Immediate Extended Transport Header (ImmDt) - 4 Bytes 

Immediate Data (ImmDt) contains data that is placed in the receive Com- 
pletion Queue Element (CQE). The ImmDt is only allowed in SEND or 
RDMA WRITE packets with Immediate Data. 

Figure 73 Immediate Extended Transport Header (ImmDt) 
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9.4 Transport Functions 



9.4.1 SEND Operation 



A QP provides the transport layer's client (e.g. the verbs layer in an MCA) 
with a specific transport service. Different transport services have various 
reliability levels for connected and connectionless communication. This 
section describes the basic functions used with each of the transport ser- 
vices. Additional transport sections go into more depth on the specifics of 
response packets, ordering, error recovery, etc. This section provides the 
high level view of the functions and how they work. 

Not all the functions are available for each transport service, as described 
in Table 36 below. The Raw Datagram transport service does not use the 

Table 36 Transport Functions Supported for 
Specific Transport Services 



Transport 
Function 


Transport Service 


Reliable 
Connection 


Unreliable 
Connection 


Reliable 
Datagram 


Unreliable 
Datagram 


Raw 
Datagram 


SEND 


supported 


supported 


supported 


supported 


not 
applicable 


RDMA WRITE 


supported 


supported 


supported 


not 
supported 


not 
applicable 


RDMA read 


supported 


not 
supported 


supported 


not 
supported 


not 
applicable 


ATOMIC Opera- 
tions 


optional 
support 


not 
supported 


optional 
support 


not 
supported 


not 
applicable 



IBA defined transport functions. Instead, Raw Datagram packets transfer 
data that is part of some other, non IBA protocol. 



The SEND Operation is sometimes referred to as a Push operation or as 
having channel semantics. Both terms refer to how the SW client of the 
transport service views the movement of data. With a SEND operation the 
initiator of the data transfer pushes data to the remote QP. The initiator 
doesn't know where the data is going on the remote node. The remote 
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node^s Channel Adapter places the data into the next available receive 1 
buffer for that QP. On an HCA, the receive buffer is pointed to by the WQE 2 
at the head of the QP's receive queue. 3 

4 

The SEND Operation is referred to as having channel semantics because 
it moves data much like a mainframe 10 channel - the data is tagged with 
a discriminator (for IBA the discriminator is the destination LID and QP ^ 
number) and the destination chooses where to place the data based on 7 
the discriminator. 8 

9 

A SEND Operation moves a single message. For the RC, RD, and UC 
transport services this message may be longer than a single packet. A 
message may range in size from zero bytes to 2^^ bytes. 



C9-10: The size of a SEND Operation, as generated by a requester, shall 
be between zero and 2^^ bytes (inclusive). 



11 
12 
13 
14 
15 

C9-11 : For RC and UC transport services in an HCA, a request message 1 6 
greater than PMTU in length shall be segmented into PMTU-sized seg- 17 
ments for transmission via multiple packets. Similarly, an HCA responder ^ g 
shall reassemble such packets back into a single message. 

20 

o9-4: For RD transport services in an HCA, a request message greater 
than PMTU in length shall be segmented into PMTU-sized segments for 21 
transmission via multiple packets. Similarly, an HCA responder shall re- 22 
assemble such packets back into a single message. 23 

24 

o9-5: For RC, UC and RD transport services in a TCA, a request message 25 
greater than PMTU in length shall be segmented into PMTU-sized seg- 
ments for transmission via multiple packets. Similarly, a TCA responder 
shall reassemble such packets back into a single message. ^7 

28 

09-12: For the Unreliable Datagram transport service, a SEND Operation 29 
shall consist only of single packet messages (i.e. the message data pay- 30 
load is limited to a maximum of the PMTU between the requester and the 3-j 
responder, i.e. 256, 512, 1024, 2048, or 4096 bytes). ^2 

A SEND Operation can, at the discretion of the client, include 4 bytes of 
Immediate data with each send message. If included, the Immediate data 3^ 
is contained within an additional header field (Immediate Extended Trans- 35 
port Header or ImmDt) on the last packet of the SEND Operation. 36 

37 
38 
39 
40 
41 
42 
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Packet #1 



Packet #2 



Packet #3 



For example, Figure 74 below shows a SEND Operation of 700 bytes re- 
quiring 3 SEND packets, (assuming a 256 Byte PMTU). 



I — I Packet Header Field 

I — I Packet Header Field 
present if necessary 
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Data Payload 
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Data Payload 
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immDt Data Payload ICRC VCRC 



Packet 


.^^^ V. i BTHiJDpQcK^ Qj^ 


#1 


"SEND First" 


#2 


"SEND Middle" 


#3 


"SEND Last" or "SEND 
Last with Immediate" 



A 700 byte SEND Operation uses 
3 packets, assuming a 256 Byte 
PMTU. Acknowledgment 
Packets, used for reliable trans- 
port services, are not shown. 



a. The BTH OpCode 
determines if the 
RDETH, DETH. and 
ImmDt headers are 
present. 





' * rieici Name ^ s.- ' , . 


LRH 


Local Route Header 


GRH 


Global Route Header 


BTH 


Base Transport Header 


RDETH 


Reliable Datagram 
Extended Transport 
Header^ 


DETH 


Datagram Extended 
Transport Header^ 


ImmDt 


Immediate Extended 
Transport Header 


ICRC 


Invariant CRC 


VCRC 


Variant CRC 



a. Present only for the 

Reliable Datagram 
transport service 

b. Present only for Reliable 

Datagram and 

— ^-...^ ^ _ . Unreliable Datagram 

Figure 74 SEND Operation Example transport service 

There are several things to note from the above figure: 

• The BTH OpCode field determines the start and end of the SEND 
message. 

• If the SEND message is less than or equal to the PMTU, then 
the BTH Opcode "SEND Only" or "SEND Only with Immedi- 
ate" is used. 

• If the SEND message is for a length of zero, then the BTH 
Opcode "SEND Only" or "SEND Only with Immediate" is 
used. In this case, there is no Data Payload field, but all other 
fields are as shown. 

• If the SEND message is greater than the PMTU, then the 
BTH Opcode of the first packet is "SEND First" and the BTH 
OpCode of the last packet is "SEND Last" or "SEND Last with 
Immediate". 

• If the SEND message is greater than twice the PMTU, then 
the packets between the first and last use the BTH OpCode 
"SEND Middle". 
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• Every packet in a message that doesn't have the opcode 1 

SEND Only, SEND Only with Immediate, SEND Last, or 2 

SEND Last with Immediate shall have a data field of PMTU 3 

length. 4 

• The responder node (the destination of the SEND Operation) 5 
does not know the final length of the SEND message until the last g 
packet with the "SEND Last" or "SEND Last with Immediate" Op- 
Code arrives. 



7 
8 

• The Packet Sequence Number field is used by the responder to g 
detect out-of-order or missing packets. 

• If the entire message is not a multiple of the PMTU, then the initial 
packets of the message carry a full PMTU number of bytes and 
the final packet carries the remainder as a partial payload. 

• For a given requesting node's QP, once a multi-packet SEND Op- 
eration is started, no other request packets may be generated un- 
til the "SEND Last" or "SEND Last with Immediate" packet. 

C9-13: A multi-packet message shall not be interleaved with other opera- 
tions on the same SEND Queue. 



10 
11 
12 
13 
14 
15 
16 
17 
18 

• Not all SEND messages carry Immediate data. If they do, a spe- 1 9 
cial header is included in the last or only packet of the message. 20 
The presence of the header is indicated by a special "SEND Last 21 
with Immediate Data" or "SEND Only with Immediate Data" Op- 22 
Code in the BTH. 23 

• For an HCA, there is no alignment requirement for the source or 24 
destination buffers of a SEND message. For buffers within a TCA, 25 
any alignment requirement is implementation specific. 

ZD 

The verbs chapter explains how the upper level SW client of an HCA uses 27 
a work request to post a buffer that is in turn segmented and sent as 28 
packets across the fabric. The same chapter also describes how the des- 
tination node posts a receive buffer into which the destination HCA reas- 
sembles the data. SEND messages initiated by a TCA use an 30 
implementation specific mechanism to create (and respond to) SEND 31 
packets. 32 

33 

09-14: When generating a packet for a SEND operation, the requester 
shall include at least these headers and fields in every packet of the re- 
quest: LRH, BTH, Data Payload, ICRC, VCRC. 

36 

09-15: When generating a response to a SEND operation, the responder 3^ 
shall include at least these headers and fields the response: LRH, BTH, 38 
AETH, ICRC, VCRC. 39 

40 
41 
42 



InfiniBand^'^ Trade Association 



Page 211 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Transport Layer October 24, 2000 

Volume 1 - General Specifications FINAL 

9A2 RDMA WRITE Operation 1 

The RDMA WRITE Operation is used by the requesting node to write into 2 

the virtual address space of a destination node. The message may be be- 3 

tween zero and 2^^ bytes (inclusive) and is written to a contiguous range 4 

of the destination QP's virtual address space (not necessarily a contig- 5 

uous range of physical memory). 6 



C9-16: For an HCA requester performing RDMA WRITE operations, the 
length of an RDMA WRITE message, as reflected in the RETH:DMALen 
field, shall be between zero and 2^^ bytes (inclusive). 



17 
18 



7 
8 
9 

10 

o9-6: If a TCA requester implements RDMA WRITE functionality, the 
length of an RDMA WRITE message, as reflected in the RETH:DMALen 1 2 
field, shall be between zero and 2^^ bytes (inclusive). 13 

14 

Before allowing incoming RDMA WRITEs, the destination node first alio- -| 5 
cates a memory range for access by the destination's QP (or group of 
QPs). A destination's channel adapter associates a 32-bit R_Key with this 
memory region or window. For a HCA, the verbs layer refers to this as reg- 
istering a memory region - see 10.6 Memorv Management on page 399 . 
TCAs use an implementation-specific mechanism to allocate and manage 1 9 
R_Keys that is outside the scope of the IBA specification. 20 

21 

The destination communicates the virtual address, length, and R_Key to 22 
any other host it wishes to have access the memory region. The commu- 
nication of address and R_Key is done by the client upper level protocol - 
the exchange is outside the scope of the IBA. For example, an application 
program might embed the address, length, and R_Key into a private data 25 
structure that it in turn pushes to other application programs using the 26 
SEND Operation. 27 

28 

C9-17: As with SEND Operations, an HCA requester shall segment a 29 
RDMA WRITE message larger than the PMTU into multiple packets. 

o9-7: If a TCA requester implements RDMA WRITE functionality, it shall 31 
segment a RDMA WRITE message larger than the PMTU into multiple 32 
packets. 33 

34 

If specified by the verbs layer, Immediate data is included in the last 35 
packet of an RDMA WRITE message. The Immediate data is not written 
to the target virtual address range, but is passed to the client after the last 
RDMA WRITE packet is successfully processed. E.G. on an HCA the im- 
mediate data is placed on the completion queue. 38 

39 
40 
41 
42 
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For example, Figure 75 below shows a 700 byte RDMA WRITE (on a path 
with a 256B PMTU). 



Packet #1 



Packet #2 



Packet #3 
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Data Payload 
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Packet Header Field 
present if necessary 
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LRH GRH 


BTH 


RDETH 


DETH 


immDt Data Paytoad ICRC VCRC 



Packet 


BTH Opcode^ 


#1 


"RDMA WRITE First" 


#2 


"RDMA WRITE Middle" 


#3 


"RDMA WRITE Last" or 
"RDMA WRITE Last with 
Immediate" 



A 700 byte RDMA WRITE Operation 
uses 3 packets, assuming a 256 Byte 
PMTU. Acknowledgment Packets, used 
for reliable transport services, are not 
shown. 



a. The BTH OpCode field 
determines if the 
RDETH. DETH, and 
ImmDt headers are 
present. 





Field Name, 


LRH 


Local Route Header 


GRH 


Global Route Header 


BTH 


Base Transport Header 


RDETH 


Reliable Datagram 
Extended Transport 
Header^ 


DETH 


Datagram Extended 
Transport Header^ 


RETH 


RDMA Extended 
Transport Header 


ImmDt 


Immediate Extended 
Transport Header 


ICRC 


Invariant CRC 


VCRC 


Variant CRC 



a. Present only for the 
Reliable Datagram 
transport service 



Figure 75 RDMA WRITE Operation Example 

There are several things to note from the above figure: 

• The BTH OpCode field determines the start and end of the RDMA 
WRITE message. 

• If the RDMA WRITE request was for a length of zero, then the 
BTH Opcode "RDMA WRITE Only" or "RDMA WRITE Only 
with Immediate" is used. In this case, there is no Data Pay- 
load field, but all other fields are as shown. 

• If the RDMA WRITE message is less than or equal to the PM- 
TU, then the BTH OpCode "RDMA WRITE Only" or "RDMA 
WRITE Only with Immediate" is used. 

• If the RDMA WRITE message is greater than the PMTU, then 
the BTH Opcode of the first pacl<et is "RDMA WRITE First" 
and the BTH OpCode of the last packet is "RDMA WRITE 
Last" or "RDMA WRITE Last with Immediate". 
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• If the RDMA WRITE message is greater than twice the PM- 1 
TU, then the packets between the first and last use the BTH 2 
Opcode "RDMA WRITE Middle". 3 

• Every packet in a RDMA WRITE message that doesn't have 4 
the opcode RDMA WRITE Only, RDMA WRITE Only with Im- 5 
mediate, RDMA WRITE Last, or RDMA WRITE Last with Im- g 
mediate has a data field of PMTU length. 

• The RETH header is present in the first (or only) packet of the 3 
message. It contains the virtual address of the destination buffer 
as well as the R_Key and message length fields. 

• The Packet Sequence Number field is used by the responder to 
detect out-of-order or missing packets. 



C9-18: For an HCA RDMA WRITE request, a multi-packet message shall 
not be interleaved with other operafions on the same SEND Queue. 



7 



9 

10 
11 
12 

If the entire message is not a multiple of the PMTU, then the inifial ^ 3 
packets of the message carry a full PMTU number of bytes and 
the final packet carries the remainder in a partial payload. 

1 5 

For a given requesting node's QP, once a multi-packet RDMA 
WRITE operation is started, no other request packets may be 
generated until the "RDMA Last" or "RDMA Last with Immediate 
Data" packet is sent. 



17 
18 
19 
20 
21 

o9-8: If a TCA requester implements RDMA WRITE functionality, then for 22 
an RDMA WRITE request, a multi-packet message shall not be inter- 23 
leaved with other operations on the same SEND Queue. 24 



25 
26 



• Not all RDMA WRITE messages carry Immediate data. If a 
RDMA WRITE does, a special header is included in the last (or 
only) packet of the message. The presence of the header is indi- 

cated by a special "RDMA WRITE Last with Immediate Data" or 28 

"RDMA WRITE Only with Immediate Data" OpCode in the BTH. 29 

For an HCA, there is no alignment requirement for the source or 30 

destination buffers of an RDMA WRITE message. For buffers 31 

within a TCA, any alignment requirement is implementation spe- 32 

cific. 33 

09-19: When generafing an RDMA WRITE Request, an HCA requester 34 

shall include at least the following headers and fields in each request 35 

packet: LRH, BTH, Data Payload, ICRC, VCRC. The first (or only) packet 35 
of the request shall also include the RETH. 

o9-9: If a TCA requester implements RDMA WRITE functionality, it shall 

behave as follows. When generafing an RDMA WRITE Request, a TCA 39 

requester shall include at least the following headers and fields in each re- 40 

quest packet: LRH, BTH, Data Payload, ICRC, VCRC. The first (or only) 41 

packet of the request shall also include the RETH. 42 
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C9-20: When generating an RDMA WRITE Response, an HCA responder 1 
shall include at least the following headers and fields in each response 2 
packet: LRH, BTH. AETH, ICRC, VCRC. 3 

4 

o9-10: If a TCA responder implements RDMA WRITE functionality, then 
when generating an RDMA WRITE Response, a TCA responder shall in- 
clude at least the following headers and fields in each response packet: ^ 
LRH, BTH, AETH, ICRC, VCRC. 7 

8 

9.4.3 RDMA READ OPERATION 9 

RDMA READ Operations are similar to RDMA WRITE Operations. They 10 
allow the requesting node to read a virtually contiguous block of memory n 
on a remote node. As with RDMA WRITES, the responding node first al- ^2 
lows the requesting node permission to access its memory. The re- 
sponder passes to the requestor a virtual address, length, and R_Key to 
use in the RDMA READ request packet. 

15 

A single RDMA READ request can read from zero to 2^^ bytes (inclusive) ^ ^ 
of data. 17 

18 

09-21 : For an HCA responding to an RDMA READ request, if the re- ig 
quested data size is greater than the PMTU, the responder shall segment 20 
the data into PMTU size data segments for transmission as multiple 
RDMA READ Response packets. The data is reassembled in the re- 
questing node's memory. 

23 

o9-11 : If a TCA responder implements RDMA READ functionality, and the 24 
requested data size is greater than the PMTU, the responder shall seg- 25 
ment the data into PMTU size data segments for transmission as multiple 26 
RDMA READ Response packets. The data is reassembled in the re- 
questing node's memory. 

28 

09-22: For an HCA requester using RDMA operations, the length of the 29 
requested RDMA READ data, as reflected in the RETH:DMALen field, 30 
shall be between zero and 2^^ bytes (inclusive). 31 

32 

09-12: If a TCA requester implements RDMA READ functionality, then the 33 
length of the requested RDMA READ data, as reflected in the 
RETH:DMALen field, shall be between zero and 2^^ bytes (inclusive). 

36 
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The following example in Figure 76 shows a 700 byte RDMA READ oper- 
ation (on a path with a 256B PMTU) 



A ladder diagram showing the single RDMA READ 
Request Packet initiated by the requestor node. In 
this example, the destination node segments the 
data into three response packets. 
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A 700 byte RDMA READ Operation 
has 3 response packets, assuming a 
256 Byte PMTU. 
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a. Present only for the 
Reliable Datagram 
transport service 



Figure 76 RDMA READ Operation Example 
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There are several items to note in the previous figure: 1 

2 

• A single request packet will result in multiple read response pack- 3 
ets if the read length is greater than the PMTU. 4 

• The BTH OpCode field identifies the packet as a RDMA READ 5 
Request or Response as well as determines if any of the extend- g 
ed transport headers are present. ^ 

• The BTH OpCode field determines the start and end of the RDMA 3 
READ Acknowledgment message. g 

If the RDMA READ request message requested a zero byte 1 0 

transfer, then the BTH OpCode "RDMA READ Response 

Only" is used. All other fields remain as shown. ^ 2 

• If the RDMA READ Acknowledgment message is less than or ^ 3 
equal to the PMTU, then the BTH OpCode "RDMA READ Re- 
sponse Only" is used. 

• If the RDMA READ message is greater than the PMTU, then 
the BTH OpCode of the first packet is "RDMA READ Re- 
sponse First" and of the last packet "RDMA READ Response 
Last". 



14 
15 
16 
17 
18 
19 
20 
21 
22 

• Every packet in a RDMA READ Response First or RDMA 

READ Response Middle message has a data field of PMTU ^3 
length. 24 

If the entire message is greater than a multiple of the PMTU, then 
the initial packets of the response message carry a full PMTU 26 
number of bytes and the final packet carries a partial payload. 27 

The Packet Sequence Number field is used to detect out-of-order 
or missing response packets. 29 

After initiating a RDMA READ Request packet, the requesting 
node may send out additional request packets without waiting for 
the response packets to return. See section 9.7.3.1 Requester 32 
Side - Generating PSN on page 248 for an explanation of how the 33 
PSN is determined for subsequent request packets. 34 

The maximum number of RDMA READ Requests for a particular 35 
QP that can be outstanding at any one time is negotiated at con- 35 
nection establishment time. A responder may restrict the connec- 37 
tion to as few as one outstanding RDMA READ request per QP. If 
ATOMIC Operations are supported, the number of outstanding 
requests negotiated at connection establishment time includes 
both ATOMIC Operation requests and RDMA READ requests. 

RDMA READ packets never carry Immediate data. 



38 
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RDMA READ Requests are retried if the requester did not receive the 1 
proper response. 2 



Retried RDMA READ Requests need not start at the same ad- ^ 
dress nor have the same length as the original RDMA READ. The 

retried request may only reread those portions that were not sue- ^ 

cessfully responded to the first time. 6 



7 



9 

10 



• The responder validates the R_Key and RDMA READ virtual ad 
dress for the retried request. ^ 

• The PSN of the retried RDMA READ must be in the duplicate 
PSN region. See Section 9.7.1 Packet Seouence Numbers (PSN) 
on page 240 11 

• The PSN of the retried RDMA READ request need not be the ^ ^ 
same as the PSN of the original RDMA READ request. Any re- ^ ^ 
tried request must correspond exactly to a subset of the original 14 
RDMA READ request in such a manner that all potential dupli- 1 5 
cate response packets must have identical payload data and 1 6 
PSNs regardless of whether it is a response to the original re- -j -j 
quest or a retried request. ^ g 

• For an HCA, there is no alignment requirement for the source or ^ 9 
destination buffers of an RDMA READ message. For buffers with- 20 
in a TCA, any alignment requirement is implementation specific. 

C9-23: When generating an RDMA READ Request, an HCA requester 22 

shall include at least the following headers and fields in its request packet: r.r, 
LRH, BTH, RETH, ICRC, VCRC. 

24 

o9-1 3: If a TCA requester implements RDMA operations, then it shall con- 25 

form to the preceding HCA requester compliance statement. 26 

27 

09-24: When generating an RDMA READ Response, an HCA responder 28 

shall include at least the following headers and fields in each response 29 
packet: LRH, BTH, Data Payload, ICRC, VCRC. If the response packet 
BTH:Opcode is "RDMA READ Response First, RDMA READ Response 
Last, or RDMA READ Response Only, the packet shall also include an 

AETH. If the response packet BTH:Opcode is "RDMA READ Response 32 

Middle, an AETH shall not be included. 33 

34 

o9-14: If a TCA responder implements RDMA operations, then it shall 35 
conform to the preceding HCA responder compliance statement. 

9A4 ATOMIC Operations 

38 

ATOMIC Operations execute a 64-bit operation at a specified address on 
a remote node. The operations atomically read, modify and write the des- 
tination address and guarantee that operations on this address by other 

QPs on the same CA do not occur between the read and the write. The 41 

42 
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scope of the atomicity guarantee may optionally extend to other CPUs 1 
and HCAs. 2 

3 

ATOMIC Operations use the same remote memory addressing mecha- ^ 
nism as RDMA READs and Writes. The virtual address specified in the re- 
quest packet is in the address space of the remote QP that the ATOMIC ^ 
Operation has targeted. 6 

7 

ATOMIC Operations consist of two packet types, the "ATOMIC Com- 8 
mand", request packet and the "ATOMIC Acknowledge" response packet, g 



1 ) ATOMIC Operations are only supported by the Reliable Connection 
and Reliable Datagram transport services. 

2) ATOMIC Operations do not support Immediate data. 



10 
11 
12 
13 

3) ATOMIC Operations support is strongly recommended to be provided 1 4 
strictly in hardware.^ 15 

4) The virtual address in the ATOMIC Command Request packet shall 16 
be naturally aligned to an 8 byte boundary. The responding CA 17 
checks this and returns an Invalid Request NAK if it is not naturally 13 
aligned. 

IBA defines the following ATOMIC Operations: 20 

21 

FetchAdd (Fetch and Add) 22 

The FetchAdd ATOMIC Operation tells the responder to read a 64-bit 23 
buffer value at a naturally aligned virtual address in the responder's 24 
memory, perform an unsigned^ add using the 64-bit Add Data field in 
the AtomicETH, and write the result (must match the memory type at 
the requester) back to the same virtual address. The responder's op- 
eration shall be atomic (i.e. undisturbed by other entities) per section 27 
9.4.4.1 Atomicity Guarantees on paoe 221 . 28 

The FetchAdd operation is performed in the endian format of the ^9 

target memory. The original remote data is converted from the endian 30 

format of the target memory for return. The fields are in Big-endian 31 

format on the wire. 32 

The requestor specifies: 33 

Remote data address and R_Key 

• Add data 

36 

The acknowledge packet returns: ^7 

38 

1 . CA implementations may use software assists - this shall be indistinguishable 

from a hardware-only implementation; Performance must be such that 

higher level software applications are not affected. 40 

2. If Signed numbers are used, this is the same as using twos complement 41 
arithmetic (the carry is not saved nor reported). ^2 



25 
26 
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• Original remote data 1 

After the operation, the responder*s memory at the specified virtual 2 

address contains the unsigned sum of the original value and the Add 3 

field in the AtomicETH header. All operations on the requester's 4 

memory are done in the native endian format of the requester. 5 

CmpSwap (Compare and Swap) 6 

The CmpSwap ATOMIC Operation tells the responder to read a 64-bit 
value at a naturally aligned virtual address in the responder's memory, 8 
compare it with the Compare Data field in the AtomicETH header, and, 9 
if they are equal, write the Swap Data field from the AtomicETH 10 
header into the same virtual address. If they are not equal, the con- 
tents of the responder's memory are not changed. In either case, the ^ ^ 
original value read from the virtual address is returned to the re- 
quester. The responder's operation shall be atomic (i.e. undisturbed 
by other entities) per section 9.4.4.1 Atomicitv Guarantees on page 14 
221 . 15 

The requestor specifies; ^ 

17 

• Remote data address and R Key 

18 

• Write (swap) data 

• Compare data 20 
The acknowledgment packet returns: 21 

Original remote data 

23 

After the operation, the remote data buffer contains the "original re- 
mote value" (if comparison did not match) or the "Write (swap) data" 
(if the comparison did match). 25 

The CmpSwap operation involves three 8 byte data buffers, the com- 
pare data, the write (swap) data, and the original remote data. All three 27 
are transmitted within the request and response packets in byte big 28 
endian format. All operations on the responder's CA memory are done 29 
in the native endian format of that memory system. All operations on 39 
the requestor's memory are done in the native endian format of the re- 
questor. 

For example, consider a big endian CA initiating a CmpSwap ATOMIC 33 
Operation request packet to a little endian responder. The request ^4 
packet contains two big endian data fields: the compare data and the 
write (swap) data. The responder converts these data fields to little en- 
dian format and does the compare and swap operation. The original 
target data field is converted to big endian format and returned in the 37 
response packet. 38 

39 
40 
41 
42 
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A ladder diagram showing the "ATOMIC 
Command" Request Packet and the re- 
turning "ATOMIC Acknowledge" response 
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Figure 77 ATOMIC Operation Example 



a. Present in ATOMIC 
Operations only for the 
Reliable Datagram transport 
service 



9.4.4.1 Atomicity Guarantees 



o9-15: When generating an ATOMIC Operation request, a requester shall 
include at least an LRH, a BTH, an AtomicETH, an ICRC and a VCRC. 
The sources of data for the LRH, BTH and AtomicETH headers shall be 
as shown in Table 60 Packet Fields and Parameters by Service on page 
359 . 

o9-16: When responding to an ATOMIC Operation request, a responder 
shall include in its response packet at least an LRH, BTH, AETH, Atomi- 
cAckETH, ICRC and a VCRC. 



o9-17: Atomicity of the read/modify /write on the responder's node by the 
ATOMIC Operation shall be assured in the presence of concurrent atomic 
accesses by other QPs on the same CA. 

o9-1 8: A CA may optionally assure atomicity of ATOMIC Operations in the 
presence of concurrent memory accesses from other CAs, 10 devices. 
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10 

11 
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and CPUs. For a HCA, the Verbs layer shall report whether it supports this 1 

enhanced atomicity guarantee. 2 

3 
4 

1) For the requestor, an ATOMIC Operation is considered complete 5 
when the response packet returns. g 

2) If an RDMA READ work request is posted before an ATOMIC Oper- 7 
ation work request then the atomic may execute its remote memory g 
operations before the previous RDMA READ has read its data. This g 
can occur because the responder is allowed to delay execution of the 
RDMA READ. Strict ordering can be assured by posting the ATOMIC 
Operation work request with the fence modifier. See the description 
for the fence modifier Post Send Request. The fence modifier causes 1 2 
the requestor to wait till the RDMA READ completes before issuing 13 
the ATOMIC Operation. 14 

3) When a sequence of requests arrives at a QP, the ATOMIC Operation 1 5 
only accesses memory after prior (non-RDMA READ) requests 1 6 
access memory and before subsequent requests access memory. 1 7 
Since the responder takes time to issue the response to the atomic ^ g 
request, and this response takes more time to reach the requestor 

and even more time for the requestor to create a completion queue 
entry, requests after the atomic may access the responders memory 

before the requestor writes the completion queue entry for the 21 

ATOMIC Operation request. 22 

4) Each ATOMIC Operation request requires an explicit response and 
acknowledge message. An ATOMIC Operation response, with a 24 
properly formed AETH, is considered an acknowledge message. 25 

9.4.4.3 Error Behavior 

27 

A responder utilizes vendor specific resources and facilities to implement 
ATOMIC Operations and RDMA READs as well as to facilitate retried 

ATOMIC requests. It is the responsibility of the requestor to ensure that all 29 

unacknowledged ATOMIC operations and RDMA READs combined do 30 

not overrun the receiver resources. The number of these resources is ne- 31 

gotiated on a per QP basis at connection setup (see 9.4.3 RDMA READ 32 

Operation on page 215 and 9.4.4 ATOMIC Operations on page 218 ). ^3 

The responding node saves the reply data, the PSN, and an indication 

that the stored data is from an ATOMIC Operation. This saved data is 35 

used to generate the response for retried ATOMIC Operations. Note that 36 

the execution of an RDMA READ operation may consume the same re- 37 

sources as is used to save the ATOMIC Operation PSN and reply data. 33 

The information is stored in the destination CP's "connection context". gg 
The "connection context" is the QP context for Reliable Connection Ser- 
vice. For Reliable Datagram Service, the "Connection context" is actually 

the "EE context". 41 

42 
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Several rules determine when the responder stores the PSN and reply 1 
data of an ATOMIC Operation: 2 

3 

• Only valid, new ATOMIC Operation requests (i.e. all header ^ 
checks are valid, the incoming PSN matches the expected PSN, 
the R_Key is valid for the data being accessed, and the address ^ 
is aligned to a 64b boundary) are saved. ^ 

• If the responder QP supports multiple outstanding ATOMIC Oper- ^ 
ations and RDMA READ Operations, the information on each val- ^ 
id request is saved in FIFO order. The FIFO depth is the same as 9 
the maximum number of outstanding ATOMIC Operations and 1 0 
RDMA READ requests negotiated on a per QP basis at connec- n 
tion setup. 12 

• Repeated ATOMIC or RDMA READ Operations are not saved 13 
again. -|4 
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The saved ATOMIC and RDMA READ state is shown in the figure below. 1 

2 



Timeline of Responder's State 
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a. For Reliable Connection Service, the state bits for tracking the 
most recent ATOMIC and RDMA READ Operations are kept in 
the per QP State. For Reliable Datagram Service these state bits 
are kept in the EE Context instead of the Per QP State. 



The ladder diagram shows multiple ATOMIC and RDMA READ Requests. In 
this example the responder's QP has agreed at connection setup time that it 
can accept up to any combination of 2 outstanding ATOMIC or RDMA READ 
Operations. This example shows how the responder maintains state of the re- 
cent ATOMIC and RDMA READ operations. Note also in the example a 
RDMA READ preceding an ATOMIC Operation but targeting the same ad- 
dress may retum the value after the ATOMIC executes. Strict ordering is pos- 
sible in an HCA by using the "fence" option when posting the ATOMIC 
following the RDMA READ. See 11 .4.1 .1 Post Send Request on page 496 



Figure 78 Responder State Maintained for ATOMIC & RDMA READ Operations 

An ATOMIC Operation is guaranteed to execute at most once. If the 
ATOMIC Operation does not execute on the destination, it is reported to 
the sender (e.g. an R_Key protection fault) with the appropriate NAK syn- 
drome response. 

However, like all operations, a non-recoverable error that occurs after ex- 
ecution at the responder, but before the response reaches the requester 
(e.g. a fatal HCA error), results in the requester not knowing the state of 
the responder's memory. This case must be detected and dealt with by 
the client or upper layer protocol. 

As with all operations, errors could occur on any of the transfers. If the 
original "ATOMIC Command" request is lost, or the "ATOMIC Acknowl- 
edge" is lost, the sender will retry using the normal retry procedures. If the 
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• The request is valid (i.e. header an OpCode are valid) 

• The request is for an ATOMIC Operation (the responder may 
check the ATOMIC Operation OpCode is the same as that of the 
stored operation) 

• The PSN of the request Is in the "duplicate region". See a de- 
scription of the PSN space in Section 9.7.1 Packet Sequence 
Numbers (PSN) on page 240 . 

• The PSN matches that of a saved ATOMIC Operation. 

A retried ATOMIC Operation that does not meet the above conditions is 
discarded by the responder. See Table 58 Responder Error Behavior 
Summarv on page 349 
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retry fails, it is not certain whether the ATOMIC Operation took place at the 1 
destination, but the connection will be in the Error state. 2 

3 

As with retries of Send and RDMA WRITE operations, if the responding ^ 
CA has actually executed the request, it will only acknowledge the request 
again, not re-run the ATOMIC Operation. This is necessary since an 

ATOMIC Operation Is not idempotent. The responder recognizes a retried ^ 

ATOMIC Operation and returns the reply data from the original acknowl- 7 

edgment that was previously stored in the QP (or EE context for Reliable 8 

Datagram service) "hidden state". The responder returns the stored result 9 

of an ATOMIC Operation if the following conditions are met: ^ q 

11 
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22 

When an ATOMIC Operation is retried, the responder does not validate 23 
the R_Key nor does it translate the virtual address in the retried request. 24 
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The figure below demonstrates a failed ATOMIC Operation response 
packet and shows the retried request and the eventual successful re- 
sponse. 




Time 0 



Time 1 



Time 2 



Time 3 



Time 4 



Time 5 



Timeline of Responder^s State 





Contents 

of 
Memory 

at 


Per QP State^ Tracking Most Recent ATOMIC & RDMA 
READ Operations 


Time 


Most Recent 


2nd Most Recent 




address 
100 


Op 


PSN 


Result 


Op 


PSN 


Result 


Time 0 


20 


na 


na 


na 


na 


na 


na 


Time 1 


22 


ATOMIC 


23 


20 


na 


na 


na 


Time 2 


25 


ATOMIC 


24 


22 


ATOMIC 


23 


20 


Time 3 


25 


ATOMIC 


24 


22 


ATOMIC 


23 


20 


Time 4 


25 


ATOMIC 


24 


22 


ATOMIC 


23 


20 


Time 5 


25 


ATOMIC 


24 


22 


ATOMIC 


23 


20 



a. For Reliable Connection Service, the state bits for tracking the most 
recent ATOMIC and RDMA READ Operations are kept in the per 
QP State. For Reliable Datagram Sen/ice these state bits are kept 
in the EE Context instead of the Per QP State. 



The ladder diagram shows multiple ATOMIC and RDMA READ Requests. In 
this example the responder's QP has agreed at connection setup time that it 
can accept up to any combination of 2 outstanding ATOMIC or RDMA READ 
Operations. This example shows a lost ATOMIC acknowledgment (at Time 
2). When the request is retried, the original result value is returned. The orig- 
inal value is returned even if subsequent operations from the same or a dif- 
ferent QP have modified the target of the ATOMIC Operation. 



Figure 79 Retrying ATOMIC Operations 

If all retries fail, that Implies that the connection is lost, and the error re- 
covery routines in the requesting CA's driver will inform the local applica- 
tion. 

The size of the operation is always 64-bits. The target must be naturally 
aligned (low 3 bits of the virtual address must be zero). An error will be 
reported if the R_Key range does not fully enclose the target. If this or an- 
other protection error occurs, It will be reported (NAK_Remote_Access) 
but will not result in taking any of the "ATOMIC Operation hidden queue" 
resources. That is, if the same request is repeated (same PSN) and the 
responding side has subsequently allocated an R_Key range, this new 
operation will now succeed. 
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9.4.5 Reserved and Manufacturer Defined Transport Function Opcodes i 

The IBA has two mechanisms for future expansion of its transport layer: 2 

3 

• Reserved and Manufacturer Defined BTH OpCodes 4 

IBA Transport layer functionality can be expanded by defining new BTH 5 
Opcodes. Two blocks of undefined OpCodes are specified. One for future 6 
revisions of the IBA and one block for manufacturer specific functions. 7 

8 
9 

This section defines the rules for ordering of transmission, execution, and 1 0 
completion for transactions for a given QP: ^ ^ 

12 

09-25: A requester shall transmit request messages in the order that the 
Work Queue Elements (WQEs) were posted. 

14 

09-26: For messages that are segmented into PMTU-sized packets, the 1 5 
data payload shall use the same order as the data segments defined by 16 
the WOE. 17 

18 

Packets from a given source QP to a given destination QP travel on the 
same path through the fabric and are received in the same order they 

were injected. 20 

21 

09-27: For reliable services on an HCA, all acknowledge packets shall be 22 

strongly ordered, e.g. all previous RDMA READ responses and ATOMIC 23 

responses shall be injected into the fabric before subsequent SEND, 24 

RDMA WRITE responses, RDMA READ response or ATOMIC Operation 25 
responses. 

o9-19: If a TCA responder implements Reliable Connection service, or if 27 

a CA responder implements Reliable Datagram service, all acknowledge 28 

packets shall be strongly ordered. That is, all previous RDMA READ re- 29 

sponses and ATOMIC responses shall be injected into the fabric before 30 

subsequent SEND, RDMA WRITE responses, RDMA READ response or 3^ 
ATOMIC Operation responses. 

09-28: A responder shall execute SEND requests, RDMA WRITE re- 

quests and ATOMIC Operation requests in the PSN order in which they 34 

are received. If the request is for an unsupport function or service, the ap- 35 

propriate response (for example, a NAK message, silent discard, or log- 35 

ging of the error) shall also be generated in the PSN order in which it was 37 
received. 

00 

09-29: The completion at the receiver is in the order sent (applies only to ^® 
SENDS and RDMA WRITE with Immediate) and does not imply previous 

RDMA READS are complete unless fenced by the requester. 41 

42 
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C9-30: A requester shall complete WQEs in the order in which they were 1 
transmitted. 2 

3 
4 
5 

C9-32: All WQEs shall be completed in the order they were posted inde- ^ 
pendent of their execution order. 7 

8 

Due to the ordering rule guarantees of requests and responses for re- 9 
liable services, the requester is allowed to write CQ completion events ^ q 
upon response receipt. 

o9-20: An application shall not depend of the contents of an RDMA ^2 
WRITE buffer at the responder until one of the following has occurred: 

14 

• Arrival and Completion of the last RDMA WRITE request 
packet when used with Immediate data. ^ ^ 

16 

17 

• Update of a memory element by a subsequent ATOMIC oper- 
ation. 

19 

o9-21: An application shall not depend on the contents of an RDMA 20 
READ target buffer at the requestor until the completion of the corre- 
sponding WQE. 



Arrival and completion of a subsequent SEND message. 



21 
22 

09-33: An application shall not depend on the contents of a receive queue 23 
buffer until the corresponding receive WQE has been completed. 24 

25 

9.6 Packet Transport Header Validation 26 

Packet transport header validation is conducted on each packet that is 27 

passed up to the transport layer from the lower IBA layers. The purpose 28 

is to ascertain that the inbound packet can be associated with a particular 29 
queue pair. If it cannot, the packet is silently discarded. Packet transport 
header validation applies only to packets using the IBA transport. 

3 1 

09-34: The transport layer shall validate the packet headers of all packets 32 

using the IBA transport according to the requirements in this section ( 9.6 33 

Packet Transport Header Validation on page 228 . A packet shall be 34 

deemed to be using the IBA transport if the msb of the LRH:LNH field is 35 
set to 1 . If the msb of the LRH:LNH field is set to zero, then the packet is 
a raw packet. Raw packets are described in Section 9.8.4 Raw datagrams 
on page 334 . 



37 
38 

09-35: For each inbound packet using the IBA transport, a CA shall vali- 39 

date the packet according to the state diagram shown in Figure 80\ The 
details of the state diagram are discussed in the remainder of this section. 41 

42 
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If the packet can be associated with a given queue pair, further validation 
is conducted by comparing certain characteristics of the packet with con- 
text information stored with the queue pair (or EE Context, in the case of 
reliable datagram service). This level of validation is described in Section 
9.7 Reliable Service on page 238 and Section 9.8 Unreliable Service on 
page 315 . 

Throughout this section, the phrase "packet is silently dropped" is used. 
The responder, unless otherwise noted, behaves as follows when a silent 
drop occurs: 

No acknowledge message is returned. 
No receive WQE is consumed by the responder 
The errant request packet is not executed. 

Any request packets received prior to the errant request are executed 
and completed normally. 

Responder does not update its expected PSN. 

Responder resumes waiting for a valid inbound request packet. 

The requester, unless otherwise noted, behaves as follows when a silent 
drop occurs: 

No send WQEs are completed as a result of a packet that is silently 
dropped. 

No direct action is taken as a result of the silently dropped packet, al- 
though error counters may be incremented or other similar events 
may occur. 

• The silently dropped packet shall not count for purposes of satisfying 
the transport timer. 

The queue state is not be changed. In addition, for connected transport 
services or reliable datagram, the connection or EE context is not torn 
down. 



9.6.1 Validating Header Fields 



This section specifies the headers and fields that must be validated by a 
receiver of an inbound packet (either a request or response packet) before 
it can rely on the integrity of the packet. 



1 . The LVer field of the LRH is verified in the link layer before a packet is 
presented to the transport layer. The ICRC and VCRC headers are also verified 
in the link layer before a packet is presented to the transport layer. A packet with 
an invalid LVer field, invalid ICRC or invalid VCRC is dropped silently before 
reaching the transport layer. 
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(mcast pkt*mcast_check=good) 



Good packet 



begin execution 
I 



Figure 80 Packet Header Validation Process 
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9.6,1.1 BTH Checks 1 

This section describes the fields of the BTH that must be validated for all 2 
incoming packets. 3 

4 

5 

C9-36: The transport shall verify that the version number of the transport g 
headers is supported by the CA or router. If the CA, switch or router does ^ 
not support the indicated version, the packet shall be silently dropped. 

The only valid transport version is zero. ^ 

9 

tver_check 10 

11 

• good: TVer field of BTH is 0x0 ^ 2 

• bad: TVer field of BTH is non-zero 1 3 

9,6.1.1.2 BTH:Destination QP, Opcode Check 14 

Since the OpCode contained in the BTH of the inbound packet is used to ^ ^ 
determine if the selected destination QP is valid, OpCode validation is 16 
combined with validating the destination QP and its current condition. 17 

18 

C9-37: The transport shall verify that the destination QP exists and that 
the QP state is valid for receiving the inbound packet. 2q 

o9-22: If the CA implements RD service, the transport layer shall verify the 

destination QP is valid for the CA or router and that the QP state is valid 22 

for receiving the inbound packet. If the destination QP or its state is in- 23 

valid, the response shall depend on the EE Context. If the EE Context is 24 

valid, a NAK-invalid RD request must be returned; else if the EE context 25 

is invalid, the packet shall be silently dropped. 2g 

97 

o9-23: For CAs which support Unreliable Datagram Multicast, the desti- 
nation QP value of OxFFFFFF shall only be valid if there is at least one 28 
locally managed QP which is configured for IBA Unreliable Datagram Mul- 29 
ticast service. 30 

31 

C9-38: BTH:OpCode[7:5] shall be checked to ensure that the service re- 32 
quested (RC, UC, RD, UD) is consistent with the configuration of the des- 
tination QR 

34 

The response to an inbound packet which contains either an invalid des- 35 
tination QP, or whose destination QP is not in a valid state for receiving 36 
the inbound packet, is dependent on the service being requested. This is 37 
determined by examining BTH:OpCode[7:5], which indicates whether the 33 
requested service is RC, RD, UC or UD. 3g 

40 
41 
42 
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Furthermore, if BTH:OpCode[7:5] indicates that the packet is for RD ser- 
vice, then the remainder of the OpCode bits must be examined to deter- 
mine if the inbound packet is a request or a response packet. 

C9-39: If BTH:OpCode[7:5] indicates that the packet is for RC, UC or UD 
services, and the destination QP does not exist, or the destination QP is 
not configured to provide the requested service, or the destination QP 
state is invalid, then the inbound packet is silently dropped. 

C9-40: If BTH:OpCode[7:0] indicates an RD request packet, (SEND, 
RDMA READ Request, etc.), and the destination QP does not exist, or the 
destination QP is not configured to provide RD service, or the destination 
QP state is invalid, then a NAK-lnvalid RD Request shall be returned. If 
BTH:OpCode[7:0] indicates an RD response packet (RDMA READ Re- 
sponse, Acknowledge, etc.), and the destination QP does not exist, or the 
destination QP is not configured to provide RD service, or the destination 
QP state is invalid, then the inbound packet shall be silently dropped. 

Table 37 Verification of Destination QP 



Error Condition 


Description 


Invalid Destination QP identifier 


No such QP exists on this CA. If the QP identi- 
fier is the IBA unreliable multicast QP 
(OxFFFFFF), there is no QP configured for IBA 
unreliable multicast service on this CA. 


Incorrect Destination QP Configuration 


The destination QP configuration is inconsis- 
tent with the service requested by 
BTH:OpCode[7:4]. 


Request packet: QP is not In Ready-to- 
Send state, Send-Queue-Drain state, or 
Ready-to-Receive state. 


Receive queue is not in a proper state to 
accept an inbound request packet. 


Acknowledge packet: QP is not in Send- 
Queue-Drain state or Ready-to-Send 
state. 


Send queue is not in a proper state to accept 
an inbound response packet. 



9.6.1.1.3 BTH:P Key 



destqp_check 

• good: destination QP specified in BTH is a valid QP, and it is in 
the correct state to receive the packet, and the configuration of 
the QP is consistent with the service being requested. 

• bad: destination QP specified in BTH does not exist, or is not in 
the correct state to receive the packet, or is configured inconsis- 
tently with the service being requested. 

C9-41 : If the destination QP is QPO, the BTH:P_Key shall not be checked. 

C9-42: If the destination QP is QP1 , the BTH:P_Key shall be compared to 
the set of P_Keys associated with the port on which the packet arrived. If 
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9.6.1.2 GRH Checks 



9.6.1.2.1 GRH:Next Header 



9.6.1.2.2 GRH:IPVers 



the P_Key matches any of the keys associated with the port, it shall be 
considered valid. 

C9-43: For all destination QPs other than QPO or QP1 , for all transport 
services except Reliable Datagram, the P_Key shall be compared with the 
P_Key associated with the responder's receive queue. An invalid P_Key 
shall cause the request packet to be silently dropped. 

09-24: For Reliable Datagram, the P_Key shall be compared with the 
P_Key associated with the responder's EE Context. An invalid P_Key 
shall cause the request packet to be silently dropped. 

pkey_check 

• good: BTH:P_Key matches value associated with recv queue or 
EE Context 

• bad: BTH:P_Key does not match value associated with recv 
queue or EE Context 



This sections describes the fields of the GRH, if present, that must be val- 
idated. 

As specified in Section, 9.6.1.5.2 IBA Unreliable Multicast Checks on 
page 238 , a multicast packet must include a GRH. 



09-44: If there is a GRH present, the Next Header field of the GRH must 
be checked. The value of the Next Header field should be set to (awaiting 
IETF decision) as defined by the IETF. Any other value indicates that this 
packet does not use the IBA transport, and the packet shall be silently 
dropped. 

nxthdr_check 

• good: GRH:NxtHdr field indicates IBA transport 

• bad: GRH:NxtHdr field indicates non-IBA transport 

09-45: If there is a GRH present, the version field of the GRH shall be 
checked. If the version number Is anything other than 6, the packet shall 
be silently dropped. 



ip_vers 



not_v6: invalid GRH version number 
v6: GRH version number is valid 
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9.6.1.2.3 GRHrSGID, GRH:DGID 



Connection Management is responsible for loading the primary SGID and 
DGID in the transport layer. If the given CA supports automatic path mi- 
gration, a set of alternate SGID and DGID are also loaded. Primary and 
alternate GID comparison and actions are per the rules defined in Section 
17.2.8 Automatic Path Migration on page 804 . 

If a GRH is present, the SGID and DGID shall be verified as follows: 

C9-46: If the destination QP is configured for UD transport service, the 
SGID shall not be validated at the transport layer. The DGID shall only be 
validated if the packet is a valid multicast packet. See 9.6.1 .5.2 IBA Unre- 
liable Multicast Checks on page 238 for a definition of a valid multicast 
packet. 

09-47: If the destination QP is configured for RC, DC, or RD transport 
services, the SGID and the DGID shall be validated at the transport layer. 
For RC and UC, invalid packets must be silently dropped. For RD, a "NAK 
Invalid RD Request" must be returned for invalid packets. 

The DGID is validated as follows: 

1) If the DGID is set to the Reserved GID, the DGID is invalid. 

2) If the DGID is set to the Loopback GID, the DGID is invalid. 

3) If the DGID's scope indicates a Multicast GID but there is no locally 
associated QP, then the DGID is invalid. 

Following these checks, the DGID is compared against the following. If it 
matches none of these, then the DGID is invalid. 

4) The DGID is compared against the Primary DGID. 

5) The upper 64-bits of the DGID is compared against the default GID 
prefix (0xFE80::0) and the lower 64-bits of the DGID is compared 
with the lower 64-bits of the Primary DGID 

6) If Automatic path migration is supported, the DGID is compared with 
the Alternate DGID. 

7) If Automatic path migration is supported, the upper 64-bits of the 
DGID is compared against the default GID prefix (0xFE80::0) and the 
lower 64-bits of the DGID is compared with the lower 64-bits of the 
Alternate DGID 

The SGID is validated as follows: 

1 ) If the SGID is set to multicast, the SGID is invalid. 

2) If the SGID is set to the Loopback GID, the SGID is invalid. 
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9.6.1.3 RDETH CHECKS 



9.6.1.3.1 RDETH:EE Context 



Following these checks, the SGID is compared to the following. If the 
SGID does not match at least one of the following, it is invalid. 

3) The SGID is compared against the Primary SGID 

4) The upper 64-bits of the SGID is compared against the default GID 
prefix (0xFE80::0) and the lower 64-bits of the SGID is compared with 
the lower 64-bits of the Primary SGID 

5) If Automatic path migration is supported, the SGID is compared with 
the Alternate SGID 

6) If Automatic path migration is supported, the upper 64-bits of the 
SGID is compared against the default GID prefix (0xFE80::0) and the 
lower 64-bits of the SGID is compared with the lower 64-bits of the Al- 
ternate SGID 



gid_check 



good 
bad 



GRH SGID and DGID compared successfully 
GRH SGID or DGID is invalid 



The section describes the fields of the RDETH, if present, that must be 
validated for reliable datagram service. 



09-25: If BTH:OpGode[7:5] indicates RD transport service, the RDETH 
shall be validated. 

09-26: The EE Context Identifier shall be verified per the rules in Table 38 
Verification of EE Context on page 235 . If the EE context is invalid, the 
packet must be silently dropped. 

Table 38 Verification of EE Context 



Error Condition 


Description 


Invalid Destination EE Context identifier 


No such EE Context exists on this OA. 


Request packet: EE Context is not in 
Ready-to-Send state, Send-Queue- 
Drain state or Ready-to-Recelve state. 


EE Context is not in a proper state to accept 
an inbound request packet. 


Acknowledge packet: EE Context is not 
in Ready-to-Send state or Send-Queue- 
Drain state. 


EE Context is not in a proper state to accept 
an inbound response packet. 



context check 



good: EE Context specified in RDETH is valid 
bad: EE Context specified in RDETH is invalid 
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9.6.1.4 DETH CHECKS 



9.6.1.4.1 DETH:Q Key 



This section describes the fields of the DETH, if present, that must be 
checked for datagram service, either reliable or unreliable. 



C9-48: If the destination QP is QPO, the DETH:Q_Key field shall not be 
validated. 

C9-49: If the destination QP Is QP1, the DETH:Q_Key field shall be con- 
sidered valid if it compares successfully to the well-known Q_Key 
0x80010000. 

C9-50: For all packets received for a queue pair configured for datagram 
service, except QPO, the Q_Key shall be checked by the receiver's re- 
ceive queue. If the Q_Keys do not match, the responder's behavior de- 
pends on whether the service is unreliable datagram or reliable datagram 
and shall be as follows: 

Unreliable Datagram: the packet shall be silently dropped. 
Reliable Datagram: 

• A NAK-lnvalid RD Request shall be returned. 

• The P_Key used in the NAK may be supplied by the responder's 
EE Context or it may be extracted from the request packet being 
acknowledged. 

• The PSN used in the NAK message is the PSN of the errant re- 
quest packet. 

• The EE Context's PSN is unchanged; it remains pointing to the 
failed request packet. 

• The responder resumes waiting for a valid inbound request pack- 
et. 

C9-51: The responder must not return an acknowledge message for a 
packet until the Q_Key and the P_Key for the packet have been checked 
by the receive queue. 

qkey_check 

• good: the Q_Key contained in the BTH matches that associated 
with the receive queue's stored Q_Key. 

• bad: the Q_Key contained in the BTH does not match that associ- 
ated with the receive queue's stored Q_Key. 
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9.6.1.5 LRH Checks 



9.6.1.5.1 LRH:SLID, LRH:DLID 



This section describes the fields of the LRH that must be validated for all 
inbound packets. 



C9-52: The 1 6 bit fully resolved SLID and DLID contained in the LRH shall 
be validated. 

C9-53: The DLID shall be validated for all transport services (reliable con- 
nected, unreliable connected, reliable datagram and unreliable data- 
gram). 

C9-54: The SLID shall be validated only for reliable connected, unreliable 
connected or reliable datagram service. The SLID shall not be validated 
for Unreliable Datagram service. 

To be valid, the SLID and the DLID contained in the LRH must compare 
exactly to one of the following: 

1) Permissive LID 

2) Multicast LID (for DLID only) 

3) Primary LID 

4) Alternate LID 

C9-55: The permissive LID shall only be accepted as valid if the destina- 
tion queue pair is QPO. 

C9-56: If the SLID is a multicast LID, it shall be invalid. 

C9-57: In an HCA configured for Reliable Connection or Unreliable Con- 
nection service, if an invalid LID is detected, the packet shall be silently 
dropped. For RC and UC service, this check is performed by the send or 
receive queue. 

o9-27: If a TCA implements Reliable Connection or Unreliable Connec- 
tion service and an invalid LID is detected, the packet shall be silently 
dropped. For RC and UC service, this check is performed by the send or 
receive queue. 

o9-28: If a CA implements Reliable Datagram service, and if an invalid 
LID is detected, the packet shall be silently dropped. For RD service, this 
check is performed by the EE Context. 

The primary SLID and DLID are stored in the QP or EE Context. If the 
given channel adapter supports transparent migration, an alternate SLID 
and DLID are also stored in the QP or EE Context as part of the alternate 
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path. The choice of whether to compare the inbound SLID and DLID to the 1 
primary or alternate LIDs is a function of the current state of the automatic 2 
path migration state machine and the state of the MigReq bit in the BTH. 3 

4 

lid_check 

5 

• good: SLID and DLID contained in the LRH matches the SLID ^ 
and DLID, respectively, stored in the QP or EE Context. 7 

• bad: SLID or DLID contained in the LRH does not match the ^ 
SLID and DLID, respectively, stored in the QP or EE Context. 9 

9.6,1 .5.2 IBA Unreliable Multicast Checks ^ ^ 

1 1 

09-58: A packet is declared to be an IBA unreliable multicast packet if the 
destination QP is OxFFFFFF. To be considered valid, it must have the fol- 

lowing three characteristics: The packet must contain a GRH, the DGID 13 

must be a valid multicast GID, and there must be at least one locally man- 14 

aged queue pair configured for multicast operation. If any of these condi- 1 5 
tions is not true, the packet is not a valid multicast packet and shall be 
dropped silently. 

In addition to these requirements, the DLID must be a member of a valid ^ ^ 

multicast group, however this check is performed at the link layer and 19 

need not be repeated here. 20 

21 

09-59: The DGID shall be used to map the inbound packet to a particular 22 

locally managed QP. 23 

multicast_check 

25 

• good: a multicast packet meets all the criteria cited above to be a 26 
valid multicast packet. 27 

bad: a multicast packet does not meet all the criteria cited above 28 
to be a valid multicast packet. 29 

30 

9.7 Reliable Service 3^ 

Reliable Service provides a guarantee that messages are delivered from 32 

a requester to a responder at most once, in order and without corruption. 33 

Key elements of the reliable service include a protection scheme to en- 34 

able detection of corrupted data (CRC), an acknowledgment mechanism ^5 
allowing the requester to ascertain that the message had been success- 
fully delivered, a packet numbering mechanism to detect missing packets 

and to allow the requester to correlate responses with requests, and a 37 

timer to allow detection of dropped or missing acknowledgment mes- 38 

sages. 39 

40 

This section addresses the acknowledgment and packet sequence num- 
bering mechanisms. The CRC mechanism for detecting packet corruption 
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is not addressed here. Note that CRCs are checked at lower protocol 
layers and may result in packets being dropped before they are delivered 
to the transport layer. These dropped packets may eventually be detected 
at the transport layer as sequence errors. 

• Characteristics of reliable service 

Messages delivered at most once, in order and without corruption 
in the absence of unrecoverable errors. 

Each message is acknowledged either explicitly or implicitly. 

Types of reliable service 

Reliable Connection 

Reliable datagram 

Reliability mechanisms 

ACK / NAK protocol 

Packet Sequence Numbers (PSN) 

Responder considers operation complete when it has: 

Received a valid "Last" or "Only" OpCode in the BTH, 

Received all packets comprising the message in proper PSN or- 
der, 

Payload has been committed to the local fault zone (SENDS or 
RDMA WRITES), 

Response has been committed to the wire for RDMA READs or 
ATOMIC Operations, 

Acknowledge packet for the last packet of the request has been 
committed to the wire (including the appropriate fields for RDMA 
READ response) 

• Requester considers the operation complete when: 

• All packets of the response (for RDMA READ or ATOMIC Opera- 
tion) have been received and committed to local memory, 

• Acknowledge message has been received and validated. 

C9-60: Before it can consider a WOE completed, the requester must wait 
for the necessary response(s) to arrive. If the requester requires an ex- 
plicit response such that it can complete a given WOE, then the requester 
shall be responsible to take the necessary steps to ensure that the needed 
response is forthcoming. 

There are several mechanisms available to accomplish this such as: 
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the responder to ensure reliable operation. The two conditions are dupli- 
cate packets and invalid packets. 
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1 ) Set the AckReq bit on the last packet of every message thus guaran- 1 
teeing that the responder will generate the needed explicit response, 2 

2) Set the AckReq bit on the last packet of the message for which an ex- 3 
plicit response is desired, 4 

3) If the AckReq bit was not set for the request for which an explicit re- ^ 
sponse was desired, the requester can retry the request (with 6 
AckReq set) thus requiring the responder to return a response, 7 

4) If the AckReq bit was not set for the request for which an explicit re- ^ 
sponse was desired, the requester can send a NOP command (e.g. 9 
RDMA WRITE request with a length of zero) and set the AckReq bit. 1 0 

The choice of which of these, or other, strategies to use is implementation ^ ^ 
dependent. 12 

13 

9.7.1 Packet Sequence Numbers (PSN) 14 

PSNs are transmitted within the Base Transport Header (BTH) for all 1 5 
packets. They are used to detect missing or out-of-order packets, and, for 1 6 
reliable services, to relate a response packet to a given request packet. 17 
Each IBA QP consists of a send queue and a receive queue; likewise, an 
EE Context has a send side and a receive side. There is a relationship be- 
tween the PSN on a requester's send queue and the PSN on the re- 
sponder's corresponding receive queue. Thus, each half of a QP (or EE 
Context) maintains an independent PSN; there is no relationship between 21 
the PSNs used on the Send queue and Receive queue of a given queue 22 
pair, or between different connection. This is illustrated in the figure below. 23 

24 
25 
26 
27 

1 ) Duplicate Packet. A duplicate packet may be recognized by the re- 28 
spender if the requester injects a request packet into the fabric more 29 
than once. This occurs when the requester detects a condition for 39 
which the prescribed recovery mechanism is to retry the operation. 

There are two primary causes of a timeout condition that may cause 32 
the requester to inject a given request packet into the fabric more than 33 
once: 2^ 
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request A-B 



/ 



Connection 



(^SEND queue ^ 



Endpoint A 



(receive queue) 




response A-B 



response B-A 




(^RECEIVE queue^ 



Endpoint B 



SEND queue ^ 



_7 



request B-A 



Request A-B and Response A-B (for reliable service) are related by PSN A-B. 
PSN A-B has no relationship to PSN B-A. 

Figure 81 Send-Receive Queues Related by PSN 

• A response is late in arriving at the requester either because a re- 
sponse packet is lost or delayed in the fabric as shown in Figure 82 
below, or because the responder experienced a delay in generating a 
response, or 

• A request packet may be lost or delayed in reaching the responder as 
shown in Figure 84. 

Regardless of the cause, the responder must be able to determine if 
an inbound request is a duplicate request that had been previously ex- 
ecuted (or not) and respond appropriately. 



Requester 




After timeout, 
Requester 
re-transmits r1 




Responder 



* Responder detects r1 as a 
duplicate request 



'r' is a request packet, 

'a' is an acknowledge packet 

Figure 82 Duplicate Request Packets 
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In the previous figure, tlie response has been lost or delayed in the 
fabric causing the requester to detect a timeout condition and re- 
transmit the request. The responder interprets the second arrival of 
the request packet as a duplicate request. 



Requester 
requester sends r1 

after timeout, 
requester re-sends r1 




Responder 



responder executes r1' 



original r1 finally appears 



'r' is a request packet, 

'a' is an acknowledge packet 

Figure 83 Ghost Request Packet 

A duplicate packet may also be detected by the responder due to a 
"ghost" request packet. This occurs when a request packet is delayed 
in the fabric long enough to cause a timeout to occur at the requester. 
The requester re-sends the original request packet to which the re- 
sponder generates the proper acknowledge message. Sometime 
later, the original (delayed) packet arrives at the responder which in- 
terprets the late arriving packet as a duplicate request. This may 
occur, for example, during automatic path migration. 

2) Invalid Request Packet Sequence. This condition occurs when the 
responder believes that one or more request packets have been lost 
in the fabric. This is illustrated in the following figure. 



Requester 




Responder 



* Responder detects a missing 
request 



'r' is a request packet, 

'a' is an acknowledge packet 

Figure 84 Lost Request Packet(s) 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 242 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



These two conditions must be detected both by the responder (for request 
packets), or by the requester (for response packets on reliable services). 



10 
11 
12 
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The distinction between an invalid packet and a duplicate packet is impor- 1 
tant since the requester's actions and the responder's actions are different 2 
for the two cases. 3 

4 

5 
6 

In addition to duplicate packets and invalid packets, there is a third condi- 7 
tion, called a Stale Packet ("TIME WAIT packet"). If a connection to a re- 8 
spender is torn down and a new connection is established while packets g 
are in flight, a packet from the old (stale) connection may arrive at the re- 
sponder. The responder, in turn, may interpret this stale incoming packet 
as a valid packet, when in fact it is a remnant of a previous connection. 
There are no transport layer mechanisms to guard against this condition; 
it is the responsibility of connection management to avoid re-using QPs ^ 3 
until there is no possibility that a stale packet could arrive at the responder. 1 4 
This is done by placing the requester and responder QPs in a "Time Wait" 1 5 
state long enough to ensure that any stale packets left in the fabric have ^ g 
expired before re-using those QPs. ^ j 

1 ft 

Duplicate packets are distinguished from invalid packets by the 24-bit 
PSN field which is carried in the base transport header, and allows room 
for uniquely naming up to 16,777,216 packets. 20 

21 

C9-61 : In order to make it possible for the responder to distinguish dupli- 22 
cate packets from out of order packets, a given send queue shall have a 23 
series of PSNs no greater than 8,388,608 outstanding at any given time. 
Therefore, a send queue shall have no more than 8,388,608 packets out- 
standing at any given time. This includes the sum of all SEND request 
packets plus all RDMA WRITE request packets plus all ATOMIC Opera- 26 
tion request packets plus all expected RDMA READ response packets. 27 

28 
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Thus, the PSN space (consisting of a range of 16,777,216 PSNs) is di- 
vided into two regions, each occupying a range of 8,388,608 PSNs, called 
the valid region and the invalid region. This is illustrated in Figure 85. 



K 



Range of Packet Sequence Numbers = 0 to 16,777,215 



Invalid Region 



-H 



224 





^ Valid Region: range = 8,388,608 PSNs ^ 












Duplicate Region 


y 






■M ► 

l^v 1 













PSN of oldest outstanding request 



^Responder's Expected PSN 



A Send Queue or EE Context may have no more than 
8,388,608 packets outstanding at any time. 

Figure 85 Valid and Invalid PSN Regions 

The responder further subdivides the valid region into an Expected PSN 
and a Duplicate region. The responder's expected PSN (ePSN) is defined 
in Section 9.7.4.1 .2 Responder - PSN Verification on page 251 . and is 
simply described as the PSN that the responder expects to find in the next 
new request packet to be received. The duplicate region is therefore de- 
fined to be the entire valid region, except for the single expected PSN. 
Simply put, a duplicate PSN is a PSN which the responder has seen and 
executed previously and which falls within the valid region. 



9.7.1.1 PSN Model for Reliable Service 



C9-62: For an HCA requester using Reliable Connection service, the re- 
quester shall insert a PSN in each packet of each request it generates. 
When responding to the request, the responder shall insert a PSN in each 
packet of each response it generates. 

o9-29: If a TCA implements Reliable Connection service, or if a CA re- 
quester implements Reliable Datagram service, the requester shall insert 
a PSN in each packet of each request it generates. When responding to 
the request, the responder shall insert a PSN in each packet of each re- 
sponse it generates. 
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1 



Except for the special case of RDMA READ responses, there is a 1 :1 re- 
lationship between the PSN in a request packet and the PSN in the cor- 2 
responding response packet. 3 



4 

5 



In the general PSN model, the requester calculates the PSN of the next 
request packet to be generated. This calculated PSN is called the Next 

PSN. At the time that the requester generates a new request packet, the ^ 

"Next PSN" is copied into the BTH and thus becomes the current PSN. 7 

The requester then calculates a new "Next PSN". 8 

9 

In order to detect missing or out of order packets, the responder also cai- ^ q 
culates the PSN it expects to find in the next inbound request packet. This 
is called the Expected PSN . 

12 

Conversely, when generating responses, the responder calculates the 13 
Response PSN to relate the response to a given request. However, due 14 
to acknowledge coalescing as described in 9.7.5.1.2 Coalesced Acknowl- 1 5 
edge Messages on page 263 , the requester cannot necessarily predict ^ g 
which one of a range of PSNs may appear in the next response packet. ^ j 
Therefore, the requester must be prepared to accept any one of a range 
of Response PSNs. The range is bounded by the PSN of the oldest unac- 
knowledged request packet and the expected response PSN of the most ^ ^ 
recently sent request packet. The requester evaluates the PSN of an in- 20 
bound response packet to ensure that it falls between these two ex- 21 
tremes. This general model is illustrated below. 22 
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Expected PSN 
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Responder 



9.7.2 ACK/NAK PROTOCOL 



Figure 86 Request/Response PSNs 



In the following sections only rarely is it not obvious from the context to 
which of the four PSNs the text is referring. Thus, it is common practice to 
refer to "PSN", or "expected PSN" or some other variant. In the cases 
where the context is not clear, the above expressions are used for clarity. 



The ACK/NAK protocol, along with packet sequence numbers, is a funda- 
mental component of reliable service, and applies to both reliable con- 
nected service and reliable datagram service. This and the following 
sections describe the protocol, provide a set of rules governing generation 
of ACK and NAK responses, specify the ACK and NAK codes and specify 
the requester's required responses when it receives either an ACK or a 
NAK response. 

The purpose of the ACK/NAK protocol is to allow the requester to ascer- 
tain deterministically if the responder correctly received the request 
packet. There are also mechanisms provided to ensure that a complete 
message was received correctly. This is accomplished through a combi- 
nation of the packet sequence number and packet OpCodes 
(first/middle/last/only packet indications). 
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Since a response packet(s) can get lost in the fabric, the ACK/NAK pro- 1 
tocol requires a requester to implement a timer to detect lost response 2 
packets. The transport timer is also described in this section. 3 

4 

The word "acknowledge" is used consistently throughout this section to 
mean either a negative (NAK) or a positive (ACK) acknowledgment. The 

generic term "response" is used to describe the acknowledgment returned 6 

by the responder to the requester. A response is carried in one or more 7 

acknowledge packets and may comprise, depending on the original re- 8 

quest message, an ACK packet, a NAK packet, a RDMA READ response 9 

or an ATOMIC Operation response. ^0 

11 
12 

Each request packet received on a reliable service shall be acknowl- 13 
edged. 14 

Each RDMA READ request requires an explicit response. A RDMA 
READ response, with a properly formed ACK Extended Transport 16 
Header (AETH) is considered a valid response. The ACK Extended 17 
Transport Header appears in the first packet and last packet (or only 
packet) of a RDMA READ response. The details are covered below 
in Section 9.7.5.1.9 RDMA READ Responses on page 275 

Each ATOMIC Request requires an explicit response. An acknowl- 21 
edge packet, with a properly formed ACK Extended Transport Head- ^2 
er (AETH) and an ATOMIC ACK Extended Transport Header 
(AtomicAckETH) is considered to be a valid response. 

Acknowledges may be coalesced; that is, a single acknowledge 
packet can serve as acknowledgment for one or more previous re- 
quest packets. 

Acknowledge packets shall be returned in the PSN order in which the 
original request packet was received, including RDMA READ re- 
sponses. 



23 
24 
25 



27 
28 
29 

30 



32 
33 



• A RDMA READ response consists of one or more packets; all other 
responses consist of exactly one packet. 

09-63: For an HCA responder using Reliable Connection service, the re- 
sponder shall behave as follows. A responder shall acknowledge each re- 
quest packet received. A responder shall generate an explicit response 34 
for each RDMA READ request received. A responder shall generate an 35 
explicit response for each ATOMIC Request received. A responder shall 36 
generate response packets in the PSN order in which the original request 37 
packets were received, including RDMA READ responses. gg 



39 
40 



o9-30; If a TCA responder implements Reliable Connection service, or if 
a CA responder implements Reliable Datagram service, the responder 
shall behave as follows. A responder shall acknowledge each request 41 

42 
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packet received. A responder shall generate an explicit response for each 1 

RDMA READ request received. A responder shall generate an explicit re- 2 

sponse for each ATOMIC Request received. A responder shall generate 3 

response packets in the PSN order in which the original request packets ^ 
were received, including RDMA READ responses. 

5 

9.7.3 Requester: Generating Request Packets ^ 

This section specifies the requirements placed on a requester as it gener- ^ 
ates request packets. ^ 

9 

9.7.3.1 Requester Side - Generating PSN 10 

C9-64: For Reliable Connection service in an HCA, the requester must 11 
place a value, called the current PSN, in the BTH:PSN field of every re- 12 
quest packet. 13 

14 

o9-31 : If Reliable Datagram service is implemented in a CA, or if Reliable ^ ^ 
Connection service is implemented in a TCA, then the requester must 
place a value, called the current PSN, in the BTH:PSN field of every re- 
quest packet. 

18 

During connection establishment, the transport layer's client programs the 1 9 
next PSN to any value between zero and 16,777,215. For proper opera- 20 
tion, the initial expected PSN value on the responder side must be loaded 21 
with the same value. 22 

09-65: For Reliable Connection service in an HCA, the initial PSN, as pro- 
grammed by the transport layer's client, is the PSN that shall appear in the 24 
first request packet generated by the requester. 25 

26 

o9-32: If Reliable Datagram service is implemented in CA, or if Reliable 27 
Connection Service is implemented in a TCA, then the initial PSN, as pro- 23 
grammed by the transport layer's client, is the PSN that shall appear in the 
first request packet generated by the requester. 

30 

Thereafter, the requester calculates the next PSN. The calculation de- 31 
pends on the operation being performed (SEND, RDMA READ, etc.) and 32 
the size of the data payload. 33 

34 
35 



With one exception, the requester shall increment the current PSN value 
by one for each request packet it generates. The single exception is for 
any request packet immediately following a RDMA READ request mes- 
sage. In this case, the request packet immediately following the RDMA 37 
READ request message shall have a PSN that is one greater than the 38 
PSN of the last expected RDMA READ response packet. In this way, the 39 
requester leaves a "hole" in the PSN sequence of the request packets, 4Q 
such that all response packets will have monotonically increasing PSNs. 

42 
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Thus, for RDMA READ Requests: 1 

2 

Let curr_PSN = PSN of a RDMA READ Request 3 



Let next_PSN = PSN of the request following a RDMA READ Request 



Current Request 
Packet 


PSN for Next Request Packet 


SEND. RDMA WRITE. 
ATOMIC Operation 


current PSN + 1 (nnodulo 2^**) 


RDMA READ 


current PSN + (number of expected RDMA READ 
response packets) +1 (modulo 2^^) 



4 
5 

Let n = the number of expected RDMA READ response packets 6 

7 

Then next_PSN = (curr^PSN + n) modulo 7^^ 8 

9 

Table 39 Requester's Calculation of Next PSN io 

11 
12 
13 
14 
15 
16 
17 

Since the requester knows both the total length of the requested RDMA 
READ data and the PMTU between the requester and the responder, and 
since there is a requirement that each response packet (except a last or 
only packet) be filled to the full PMTU size, the requester can calculate the 20 
total number of expected response packets and thus calculate the PSN of 21 
the request immediately following the RDMA READ request. 22 

23 

C9-66: For an HCA requester using Reliable Connection service, the re- 24 
quester shall behave as follows. For each request packet other than the 
packet immediately following an RDMA READ request message, the re- 
quester shall increment the next PSN value by one modulo 2^^. For any 
request packet immediately following a RDMA READ request message, 
the packet shall have a PSN that is one greater (modulo 2^"*) than the PSN 
of the last expected RDMA READ response packet. 29 

30 

o9-33: If a TCA requester implements Reliable Connection service, or if a 31 
CA requester implements Reliable Datagram service, the requester shall 32 
behave as follows. For each request packet other than the packet imme- ^3 
diately following an RDMA READ request message, the requester shall 

increment the next PSN value by one modulo 2^^. For any request packet 
immediately following a RDMA READ request message, the packet shall 
have a PSN that is one greater (modulo 2^"^) than the PSN of the last ex- 
pected RDMA READ response packet. 

38 

9.7.3.2 Requester - Special Rules for Reliable Datagram ROD Checking 39 

For reliable datagram service, any given send queue is associated with an 40 
EE Context by a Reliable Datagram Domain (RDD). Each send queue 41 
and EE Context has a single RDD associated with it. Before sending a re- 42 
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quest, the EE context must check the RDD of the currently active send 
queue. If the send queue's RDD does not match the EE Context's RDD, 
the current message transfer is terminated and a timeout condition is in- 
dicated to the send queue. 

o9-34: For each request, the requester must confirm that the RDD of the 
currently active send queue matches the RDD of the selected EE context. 

09-35: If the send queue's RDD does not match the EE Context's RDD, 
the current message transfer is terminated and a timeout condition must 
be indicated to the send queue. 



9.7.3.3 Requester - Generating Opcodes 



The opcodes generated by a requester must fit into a schedule of opcodes 
as shown below. 

C9-67: A requester must generate packet opcodes which fit within the 
schedule of valid OpCode sequences as shown in Table 40 Schedule of 
Valid QpCode Sequences on page 250 . 

Table 40 Schedule of Valid OpCode Sequences 



Previous Packet OpCode 


Valid Opcodes for Current Packet 


None e.g., first packet following 
connection establishment 


"First" packet 
"Only" packet 


"First" packet 


"Middle" packet (message is 3 or more packets) 
"Last" packet (message is exactly 2 packets) 
Type of operation must match the previous OpCode 


"Middle" packet 


"Middle" packet 
"Last" packet 

Type of operation must match the previous OpCode 


"Last" packet 


"First" packet (1st packet of a new message) 

"Only" packet (1st packet of a new single packet msg) 


"Only" packet 


"First" packet 
"Only" packet 



C9-68: When generating a request packet, the BTH:Opcode shall be as 
specified in Table 35 OpCode field on page 200 . 



9.7.3.4 Requester - Generating Payloads 



The requester shall generate payload lengths as a function of the opcode 
as follows: 

C9-69: If the OpCode specifies a "first" or "middle" packet, then the packet 
payload length must be a full PMTU size. 
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C9-70: If the OpCode specifies a "only" packet, then the packet payload 1 
length must be between zero and PMTU bytes in size. Thus, the only way 2 
to create a zero byte length transfer is by use of a single packet message. 3 

4 
5 
6 

C9-72: For an HCA, if the OpCode specifies an RDMA WRITE request, 7 
the length specified in the DMALen field of the RETH shall be no less than 8 
zero, and no greater than 2^'* bytes. 9 

10 

o9-36: If RDMA WRITE is implemented in a TCA and the OpCode spec- 

ifies an RDMA WRITE request, the length specified in the DMALen field 

of the RETH shall be no less than zero, and no greater than 2^^ bytes. ^ ^ 

9.7,4 Responder: Receiving Inbound Request Packets 

15 

This section describes the process used by a responder when it receives 
an inbound request packet. 

17 

9.7.4.1 Responder - Inbound Packet Validation 18 

09-73: For Reliable Connection service in an HCA responder, inbound re- ^ ^ 
quest packets shall be validated as shown in Figure 87. 20 

21 

o9-37: If Reliable Datagram service is implemented in a CA, or if Reliable 22 
Connection service is implemented in a TCA, inbound request packets 23 
shall be validated as shown in Figure 87. 24 

The following sections describe each of the validation checks and the re- 
spender's behavior / response. 26 

27 

9.7.4.1.1 Responder - Special Rules for Reliable Datagram ROD checking 28 

o9-38: For RD within a HCA, when an inbound packet arrives, the receive 29 
queue must test its own RDD value against that of the EE Context over 3Q 
which the inbound packet arrived. If they do not match, the receive queue 
must drop the packet and schedule a NAK-lnvalid RD Request. The 
P_Key and PSN to be used for returning the NAK shall be supplied by the 
EE Context. 33 

34 

9.7.4.1.2 Responder - PSN Verification 35 

09-74: For Reliable Connection service in an HCA responder, and before 36 

executing the inbound request, the responder shall check the PSN by 37 
comparing the inbound BTH:PSN to the responder's expected PSN. The 

PSN shall be checked by the responder's receive queue. 2^ 

o9-39: If Reliable Datagram service is implemented in a CA, or if Reliable 
Connection service is implemented in a TCA, and before executing the in- 41 



42 
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Figure 87 Inbound Request Packet Validation 
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bound request, the responder shall check the PSN by comparing the in- 1 

bound BTHPSN to the responder's expected PSN. The PSN shall be 2 

checked by the responder's receive queue. 3 

4 

For reliable datagram service, the PSN is checked by the responder's EE 

Context. ^ 

6 

To a large extent, the responder's behavior in responding to a request is 7 

based on an interpretation of the incoming PSN. 8 

9 

Logically, a receive queue or EE Context maintains an expected PSN 
(ePSN). This is the PSN that the responder expects to find in the BTH of 
the next new request packet it receives. The rules that the responder uses 
to calculate its next expected PSN are the same as those used by the re- 

quester when it calculates the PSN value to insert in its next request ^3 

packet. 14 

15 

09-75: For Reliable Connection service in an HCA responder, a re- 

spender shall use the rules given in 9.7.3.1 Requester Side - Generating ^ j 
PSN on page 248 to calculate its expected PSN. 

o9-40: If Reliable Datagram service is implemented in a CA, or if Reliable ^ 9 

Connection service is implemented in a TCA, a responder shall use the 20 

rules given in 9.7.3.1 Requester Side - Generating PSN on page 248 to 21 

calculate its expected PSN. 22 



23 
24 



The responder's expected PSN is initialized at connection establishment 
time by the Connection Manager to any value between zero and 
16,777,215. For proper operation, this initial value must match the initial 
next PSN value as loaded on the requester. 26 

27 

The initial expected PSN can only be set by the client when the queue is 28 
in the Initialized state. Attempts by the client to set the PSN when it is in 29 
any other state may be ignored by the transport layer. 

09-76: For Reliable Connection service in an HCA responder, the HCA 

shall update its expected PSN only when the receive queue (or EE Con- 32 

text) is in a state such that it is properly conditioned to receive request 33 

packets. For example, the transport does not modify the expected PSN 34 

when the queue pair is in the Initialized state. 35 

36 

o9-41 : If Reliable Connection service is implemented in a TCA, the re- 
sponder shall update its expected PSN only when the EE Context is in a 
state such that it is properly conditioned to receive request packets. For 
example, the transport does not modify the expected PSN when the 39 
queue pair is in the Initialized state. 40 

41 
42 
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o9-42: If Reliable Datagram service is implemented in a CA, the re- 1 

sponder shall update its expected PSN only when the EE Context is in a 2 

state such that it is properly conditioned to receive request packets. 3 

4 

When compared to its expected PSN, the actual PSN of an inbound re- 
quest message may fall into one of three regions; it may be exactly equal 

to the responder's expected PSN, it may be logically "less" than the re- ^ 

sponder's expected PSN and thus fall into the duplicate region as shown 7 

in Figure 85, or it may fall outside both the valid region and the expected 8 

PSN "region", and thus be invalid. 9 



10 
11 



Expected (new) Request: An inbound request packet received with a 
PSN that exactly matches the responder's expected PSN is a new request 

packet. ^ ^ 

13 

C9-77: For Reliable Connection service in an HCA responder, a new re- 14 

quest packet shall be validated normally and executed according to the 1 5 

rules governing order of execution. Once the request has been executed, ^ q 
a response shall be scheduled as specified in 9.7.5 Responder: Gener- 
ating Responses on oaae 262 . 

o9-43: If Reliable Datagram service is implemented in a CA, or if Reliable ^9 

Connection service is implemented in a TCA, a new request packet shall 20 

be validated normally and executed according to the rules governing 21 

order of execution. Once the request has been executed, a response shall 22 

be scheduled as specified in 9.7.5 Responder: Generating Responses on 23 

Paqe 262. ^4 

Note that it is not required to return a discrete acknowledge packet for 

each inbound request packet. 26 

27 

Once a packet with a valid expected PSN has been received, the re- 28 

sponder advances its expected PSN by calculating the new expected 29 

PSN, and slides the valid region window up to reflect the new range of of^ 
valid PSNs. 

31 

Valid Duplicate Request: A PSN that falls within the valid region, but is ^2 

not the expected PSN, is a valid duplicate request packet. 33 

34 

09-78; For Reliable Connection service in an HCA responder, the re- 35 
sponder shall respond to valid duplicate requests as specified in 9.7.5.1.4 
Acknowledging Duplicate Requests on page 267 . 

o9-44: If Reliable Datagram service is implemented in a CA, or if Reliable 

Connection service is implemented in a TCA, then the responder shall re- 39 

spond to valid duplicate requests as specified in 9.7.5.1 .4 Acknowledging 40 

Duplicate Requests on page 267 . 41 

42 



InfiniBand^'^ Trade Association 



Page 254 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Transport Layer 



October 24, 2000 
FINAL 



Table 41 summarizes those actions. 



1 
2 



Table 41 Summary: Responder Actions for Duplicate PSNs 3 



Duplicate Request 
Message 


Responder Action 


SEND or RDMA WRITE 


Schedule acknowledge packet 


RDMA READ 


Re-execute request, schedule response 


ATOMIC Operation 


Do not re-execute request, after validating the request, return 
the saved results from the referenced ATOMIC Operation 
request. 



4 

5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



Invalid Request: A packet with an actual received PSN outside ttie valid 
region and not in the expected "regions" is an invalid request. An invalid 
PSN value is generally an indication that one or more request packets 
have been lost in the fabric. 

The responder's detailed behavior in response to an invalid request re- 
quest packet is as follows: 

• The errant request packet is not executed. 

• Any request packets received prior to the errant request must be 
executed and completed before the NAK-Sequence Error is is- 
sued since it acts as an implicit ACK for prior outstanding SEND 
or RDMA WRITE requests, and as an implicit NAK for outstand- 
ing RDMA READ or ATOMIC Operation requests. 

Return a NAK-Sequence error to the requester. 

• The responder does not update its expected PSN. 

C9-79: For Reliable Connection service in an HCA responder, when the 
actual PSN of an inbound request message is outside the valid region (In- 
valid Request), a NAK-Sequence Error shall be returned by the re- 
sponder. Any request packets received prior to the errant request must 
be executed and completed before the NAK-Sequence Error is issued. 

o9-45: If Reliable Datagram service is implemented in a CA, or if Reliable 
Connection service is implemented in a TCA, and if the actual PSN of an 
inbound request message is outside the valid region (Invalid Request), 
then a NAK-Sequence Error shall be returned by the responder. Any re- 
quest packets received prior to the errant request must be executed and 
completed before the NAK-Sequence Error is issued. 

The responder resumes waiting for a valid inbound request packet that 
has a PSN equal to its expected PSN or within its valid region. If, while 
waiting for a valid new request, the responder receives any subsequent 
invalid request packets, those packets are simply dropped silently; no 
NAK is returned. 



InfiniBand^'^ Trade Association 



Page 255 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Transport Layer October 24, 2000 

Volume 1 - General SPECtFiCATiONS FINAL 

C9-80: For Reliable Connection service in an HCA responder, after gen- 1 

erating a NAK-Sequence Error, the responder shall not generate an ACK 2 

or NAK until it receives either a valid new request, or a valid duplicate re- 3 

quest. ^ 

o9-46: If Reliable Datagram service is implemented in a CA, or if Reliable ^ 

Connection service is implemented in a TCA, then after generating a ^ 

NAK-Sequence Error, the responder shall not generate an ACK or NAK 7 

until it receives either a valid new request, or a valid duplicate request. 8 

9 

There is no requirement that the queue be stopped or for a connected q 
transport service that the connection be torn down. 

9.7,4.1.3 Responder - OpCode Sequence Check 

A request packet must fit within a schedule of valid OpCode sequences 



13 



21 
22 



For Reliable Connected and Reliable Datagram services the responder 14 
shall check the sequence of packet OpCodes comprising the request 1 5 
message as follows: 1 6 

17 

1 ) If this is the first packet following establishment of the connection, ^ g 
then the packet OpCode must indicate either "first" or "only". 

2) If the last valid packet received had an OpCode indicating "first", then 20 
the current OpCode must indicate either "middle" or "last". It must 
also match the operation type specified in the last valid packet (Send, 
RDMA, ATOMIC Operation). It is an error if the current OpCode indi- 
cates "first" or "only", since that implies that the last packet of the pre- 23 
vious message was missed. 24 

3) If the last valid packet received had an OpCode indicating "middle", 
then the current OpCode must indicate either "middle" or "last". It 26 
must also match the operation type specified in the last valid packet 27 
(Send, RDMA, ATOMIC Operation). It is an error if the current 28 
OpCode indicates "first" or "only" packet since that implies that the 29 
last packet of the previous message was missed. go 

4) If the last valid packet received had an OpCode indicating "last", then 31 
the current OpCode must indicate either "first" or "only". It is an error 32 
if the current OpCode indicates either "middle" or "last", since that im- 
plies that the first packet of the message was missed. 

5) If the last valid packet received had an OpCode indicating "only", then 35 
the current OpCode must indicate either "first" or "only". It is an error gg 
if the current OpCode indicates either a middle packet or last packet 
since that implies that the first packet of the message was missed. 

39 
40 
41 
42 
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These rules are stated succinctly in the following table. 

Table 42 Schedule of Valid OpCode Sequences 



Previous Packet OpCode 


Valid Opcodes for Current Packet 


None e.g., first packet following 
connection establishment 


"First" packet 
"Only" packet 


"First" packet 


"Middle" packet (message is 3 or more packets) 
"Last" packet (message is exactly 2 packets) 
Type of operation must match the previous OpCode 


"Middle" packet 


"Middle" packet 
"Last" packet 

Type of operation must match the previous OpCode 


"Last" packet 


"First" packet (1st packet of a new message) 

"Only" packet (1st packet of a new single packet msg) 


"Only" packet 


"First" packet 
"Only" packet 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



C9-81: For an HCA responder using Reliable Connected service, the re- 
sponder shall check that the sequence of packet OpCodes comprising the 
request message conforms to the schedule shown in Table 42 Schedule 
of Valid QpCode Sequences on page 257 . If the responder detects an in- 
valid opcode sequence, it shall return a NAK-lnvalid Request to the re- 
quester. 

o9-47: If a TCA responder implements Reliable Connected service, the 
responder shall check that the sequence of packet OpCodes comprising 
the request message conforms to the schedule shown in Table 42 
Schedule of Valid OoCode Sequences on page 257 . If the responder de- 
tects an invalid opcode sequence, it shall return a NAK-lnvalid Request to 
the requester. 

o9-48: If a CA responder implements Reliable Datagram service, the re- 
sponder shall check that the sequence of packet OpCodes comprising the 
request message conforms to the schedule shown in Table 42 Schedule 
of Valid OpCode Seouences on pace 257 . 

The detailed behavior in the presence of an invalid OpCode sequence is 
specified in Section 9.9 Error detection and handling on page 336 . 

For Reliable Datagram service, an invalid OpCode sequence does not 
necessarily imply an error in the current request packet if the PSN se- 
quence is valid. It does, however, imply an error with the previous request 
message. 

o9-49: For Reliable Datagram service, If a "first" or "only" OpCode is re- 
ceived when a "middle" or "last" OpCode was expected, and if the PSN 
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indicates a new valid request, then the responder shall treat the packet as 1 
the first packet of a new request message. The responder shall complete 2 
the previous message in error and begin receiving the new message. 3 



The responder's behavior in the presence of an invalid OpCode sequence 
is detailed in Section 9.9 Error detection and handling on oaae 336 . 



5 
6 

9.7.4.1 ,4 Responder OpCode Validation 7 

C9-82: Before executing an inbound request, the responder shall validate 8 
the OpCode field of the BTH. 9 



The OpCode is checked for the following characteristics: 

• The requested function (Send, RDMA, ATOMIC) is supported by this 
receive queue, 

• If the request is for an RDMA READ or an ATOMIC Operation, there 
are sufficient resources available to receive it. 



10 
11 
12 
13 
14 
15 
16 
17 



As the packet was passed up to the transport layer, BTH OpCode 
field[7:5] was checked to ensure that the requested operation was for a 
reliable service. If it was not, then the packet was silently dropped. This ^ ^ 
check is specified in Section 9.6 Packet Transport Header Validation on ^ 9 
page 228 . Thus, before the packet arrives at the queue pair for validation 20 
according to the rules in this section, it is already known to be a request 21 
for a reliable service. 



22 
23 
24 



09-83: For Reliable Connection service in an HCA responder, if the re- 
quest is for a function which this receive queue does not support, then a 

NAK-lnvalid Request shall be returned. ^5 

26 

For example, if the queue pair is not configured to accept requests for 27 

RDMAs, but the request is for an RDMA WRITE, then a NAK-lnvalid Re- 28 

quest shall be returned. 29 

30 

o9-50: If Reliable Datagram service is implemented in a CA, or if Reliable 
Connection service is implemented in a TCA, if the request is for a func- 
tion which this receive queue does not support, then a NAK-lnvalid Re- 32 
quest shall be returned. 33 

34 

09-84: For Reliable Connection service in an HCA responder, and the 35 
BTH OpCode field[4:0] specifies a Reliable Connection reserved opcode ^5 
or a Reliable Datagram reserved opcode, a NAK-lnvalid Request shall be 
returned. 

38 

o9-51 : If Reliable Datagram service is implemented in a CA, or if Reliable 39 
Connection service is implemented in a TCA, and the BTH OpCode 40 
field[4:0] specifies a Reliable Connection reserved opcode or a Reliable 4^ 

42 
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Datagram reserved opcode, then a NAK-lnvalid Request shall be re- 1 
turned. 2 

3 

C9-85: For Reliable Connection service in an HCA responder. if BTH Op- ^ 
Code field[4:0] specifies a first or middle request packet (e.g. SEND First, 
or RDMA WRITE Middle), then the pad count bits shall be verified to be ^ 
bOO, indicating no pad bytes are present. If the pad count bits are non- 6 
zero, a NAK-lnvalid Request shall be returned. 7 

8 

09-52: If Reliable Datagram service is implemented in a CA, or if Reliable g 
Connection sen/ice is implemented in a TCA, if BTH OpCode field[4:0] 
specifies a first or middle request packet (e.g. SEND First, or RDMA 
WRITE Middle), then the pad count bits shall be verified to be bOO, indi- 
cating no pad bytes are present. If the pad count bits are non-zero, a NAK 
Invalid Request shall be returned. 13 

14 

09-86: For Reliable Connection service in an HCA responder, if there are 1 5 
insufficient resources to receive a new RDMA READ or ATOMIC Opera- -^g 
tion request, then a NAK-lnvalid Request shall be returned. 

1 8 

09-53: If Reliable Datagram service is implemented in a CA, or if Reliable 
Connection service is implemented in a TCA, and if there are insufficient 
resources to receive a new RDMA READ or ATOMIC Operation request, 20 
then a NAK-lnvalid Request shall be returned. 21 

22 

The behavior for returning a NAK-lnvalid Request is as follows: 23 

24 
25 

• Any request packets received prior to the errant request must be 2g 
executed and completed before the NAK-lnvalid Request is is- 
sued. This is important since the NAK effectively coalesces re- 
sponses to earlier outstanding request and acts as an implicit 28 
response for prior outstanding SENDs, RDMA WRITES, ATOMIC 29 
Operations or RDMA READ requests. See Section 9.7.5.1.2 Coa- 30 
lesced Acknowledge Messages on page 263 for details. 3 ^ 

• NAK-lnvalid Request is returned. 32 

• The responder does not update its expected PSN. 

09-87: For Reliable Connection service in an HCA responder, any re- 
quest packets received prior to a packet containing an invalid opcode 
must be executed and completed before a NAK-lnvalid Request is issued 36 
by the responder. 37 

38 

o9-54: If Reliable Datagram service is implemented in a CA, or if Reliable 39 
Connection service is implemented in a TCA, then any request packets re- 
ceived prior to a packet containing an invalid opcode must be executed 
and completed before a NAK-lnvalid Request is issued by the responder. 

42 



33 
34 
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More detail on error behavior in the presence of an invalid request is given 1 
in Section 9.9.3 Responder Side Behavior on page 349 . 2 

3 
4 

A R_Key violation is caused by any or all of the following conditions for ^ 
either a RDMA READ, RDMA WRITE, or ATOMIC Operation: 

6 

• The R_Key field of the RETH is invalid (for RDMA READ or 7 
WRITES) 8 

• The R_Key field of the AtomicETH is invalid (for ATOMIC Opera- ^ 
tions). 10 

1 1 

• The virtual address and length, or type of access specified, is out- 
side the locally defined limits associated with the R_Key. For an ^2 
RDMA WRITE request, the length check is conducted on a per 13 
packet basis, and is based on the LRH:PcktLength field. For an 14 
RDMA READ request, the length check is based on the 15 
RETH:DMA Length field. 

09-88: For an HCA responder using Reliable Connection service, for 17 

each zero-length RDMA READ or WRITE request, the R_Key shall not be ^ g 
validated, even if the request includes Immediate data. 

20 

o9-55: If an HCA responder implements Reliable Datagram service, or if 

a TCA responder implements Reliable Connection and RDMA function- ^1 

ality, or if a TCA responder implements Reliable Datagram service and 22 

RDMA functionality, the responder shall behave as follows. For each 23 

zero-length RDMA READ or WRITE request, the R_Key shall not be val- 24 

idated, even if the request includes Immediate data. 25 

09-89: If the responder's receive queue detects a R_Key violation, a 
NAK-Remote Access Error shall be returned to the requester using the 

PSN of the errant request packet. 28 

29 

09-90: Any request packets received prior to a packet containing an 30 

R_Key violation shall be executed and completed before a NAK-Remote 3-] 

Access Error is issued by the responder. 22 

33 
34 

9.7.4.1 .6 Responder - Length Validation'* 35 

09-91 : The PktLen field of the LRH shall be checked to confirm that there 36 

is sufficient space available in the receive buffer specified by the receive 37 

WQE. If the buffer defined by the receive WOE is insufficient to hold an 33 

inbound SEND request then a NAK-lnvalid Request shall be returned. gg 

40 
41 



1 . CAs are not required to validate the GRH packet length. ^2 
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C9-92: The length of the packet shall also be validated by comparing it to 1 

the OpCode as follows: 2 

3 

If the OpCode specifies a "first" or "middle" packet, then the packet pay- ^ 

load length must be a full PMTU size. ^ 

If the OpCode specifies a "only" packet, then the packet payload length ^ 

must be between zero and PMTU bytes in size. Thus, the only way to 7 

create a zero byte length transfer is by use of a single packet message. 8 



If the OpCode specifies a "last" packet, then the packet payload length 
must be between one and PMTU bytes in size. 



9 

10 
11 

09-93: If a packet is detected with an invalid length the request shall be 

an invalid request. '1 3 

14 

The responder's behavior in such a case is specified in Section 9.9.3 Re- 1 5 
sponder Side Behavior on page 349 . 

17 

In addition to checking the LRH:PktLen field, the DMA Length field of the 
RETH is checked as follows. 

19 

For an RDMA WRITE request, the responder may optionally check the 20 
DMA Length field in the RETH to ensure that it does not specify a transfer 21 
length of greater than 2^^ bytes. It may also, at the end of the transfer, 22 
verify that the sum of the packet payloads equalled the DMALen field in 23 
the RETH. If the responder detects either of these conditions, it may treat 24 
the request as an invalid request. ^_ 

09-94: For an HCA validating an inbound RDMA READ request, the DMA ^6 
Length field shall be checked. If the request is for greater than 2^^ bytes, 

then a NAK-lnvalid Request shall be returned. 28 

29 

09-56: If a TCA implements RDMA operations, then for an inbound RDMA 30 

READ request, the DMA Length field shall be checked. If the request is 3^ 

for greater than 2^^ bytes, then a NAK-lnvalid Request shall be returned. 32 



33 
34 
35 



9.7.4.1.7 Responder Local Operation Validation 

A valid inbound request may still fail to complete due to a failure that is 
local to the responder, e.g. local memory translation error while accessing 
local memory. All local responder errors are reported to the requester as ^6 
NAK-Remote Operational Error. See 9.7.5.2.6 Remote Operational Error 37 
on page 280 for additional details. 38 

39 
40 
41 
42 
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9.7.5 Responder: Generating Responses 

9.7.5.1 Responder Side Behavior 

This section specifies the required behavior that a responder must follow 
when generating acl<nowledge messages. 

9.7.5.1.1 Generating PSNs for Acknowledge Messages 

As the responder generates a response to each request, it shall identify 
the request with which the response is associated by inserting a PSN in 
the BTH of the response. 

This allows the requester to correlate response packets it receives with its 
request. This basic concept is illustrated below in Figure 88 Example: 
PSNs for Response Messages on page 262 . 



Requester 
request: PSN=1- 



request: PSN=2— > 
request: PSN=3— > 




Responder 



response PSN=1 



— response PSN=3 

'r' is a request packet 

'a' is an acknowledge packet (message) 



Figure 88 Example: PSNs for Response Messages 

C9-95: For responses to SEND requests or RDMA WRITE requests the 
responder shall insert in the PSN field of the response the PSN of the 
most recent request packet being acknowledged. 

Because of the rules for coalescing acknowledges (given in Section 
9.7.5.1 .2), the PSNs for consecutive response packets may not neces- 
sarily be sequential. 
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C9-96: For HCA responses to RDMA READ requests, the PSNs of the re- 1 

sponse packets must be sequential and monotonically increasing begin- 2 

ning with the PSN of the original RDMA READ request message. 3 

4 

09-57: If a TCA implements RDMA READ functionality, then for each 
RDMA READ response the PSNs of the response packets must be se- 
quential and monotonically increasing beginning with the PSN of the orig- ^ 
inal RDMA READ request message. 7 

8 

o9-58: Since ATOMIC Operation requests require an explicit response, 9 
and since an ATOMIC Operation request is restricted to a single packet, 
the PSN of the response packet must be identical to the PSN of the re- 
quest. 

12 

9.7.5.1.2 Coalesced Acknowledge Messages 13 

It is not required that there be a unique, discrete response for each re- 14 

quest packet. Instead, the responder may acknowledge several out- 15 

standing request packets with a single acknowledge packet. This is called 1 6 

acknowledge coalescing. 1 7 



18 
19 



A given response packet acknowledges prior outstanding requests (i.e., 
those with earlier PSNs than the PSN contained in the BTH of the re- 
sponse packet) as follows: ^0 

21 

1 ) If there is an outstanding RDMA READ or ATOMIC Operation request 22 
with a PSN earlier than the PSN in the BTH of the response packet, 23 
then the response packet implies a negative acknowledgment for the 24 
oldest such outstanding RDMA READ or ATOMIC Operation request. 
Any requests posted to the send queue subsequent to such an 

RDMA READ or ATOMIC Operation request are not acknowledged. 

This is illustrated in Figure 89 Requester Interpretation of Coalesced 27 

Acknowledges on page 264 . 28 

2) It implies a positive acknowledgment for all outstanding SEND or 
RDMA WRITE request packets with a PSN earlier than the PSN in 30 
the BTH of the response packet, unless such an outstanding SEND 31 
or RDMA WRITE request falls after an outstanding RDMA READ or 32 
ATOMIC Operation request as described above. 33 

3) If the given response is an RDMA READ response message, it is the 34 
first (or only) packet of a RDMA READ response message that im- 35 
plicitly acknowledges prior outstanding requests. 3Q 

4) The last (or only) packet of a RDMA READ response message ex- 37 
plicitly acknowledges only the RDMA READ request. 33 

These rules are illustrated in Figure 89. 39 

40 
41 
42 
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REQUESTER 
SEND request — > 

SEND request — > 



RDMA READ request 



RESPONDER 



SEND request — > 




Lost RDMA READ 
response 



Acknowledge packet a4: 

1) implicitly acknowledges SEND requests r1 and r2, 

2) implicitly NAKs RDMA READ request r3. 

3) does not acknowledge SEND request r4. 

Thus, acknowledges for requests rl and r2 have been coa- 
Figure 89 Requester Interpretation of Coalesced Acknowledges 

9.7.5.1 .3 Acknowledging RDMA READ Requests 

An RDMA READ response is different from a normal response in that it 
contains a data payload. 

Every RDMA READ request message requires a discrete acknowledg- 
ment, called the RDMA READ response which consists of one or more 
packets. 

C9-97: For an HCA, if an RDMA READ response contains more than one 
packet, the first and last packets must contain an AETH. Both AETHs shall 
contain a valid Message Sequence Number (MSN). 

o9-59: In a TCA implementing RDMA, if an RDMA READ response con- 
tains more than one packet, the first and last packets must contain an 
AETH. Both AETHs shall contain a valid Message Sequence Number 
(MSN). 
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The AETH in the first packet implicitly acknowledges prior outstanding re- 1 

quests as specified in Section 9.7.5.1.2 Coalesced Acknowledge Mes- 2 

sages on page 263 . The AETH in the last packet acknowledges the 3 

RDMA READ request. ^ 

C9-98: For an HCA, if an RDMA READ response is itself a single packet, ^ 

then that packet must contain an AETH. 6 

7 

o9-60: If a TCA implements RDMA functionality, and an RDMA READ re- 8 

sponse is itself a single packet, then that packet must contain an AETH. 9 



10 
11 



The AETH contained in a single packet RDMA READ response serves to 
both implicitly acknowledge prior outstanding requests as well as to ex- 
plicitly acknowledge the RDMA READ request itself. ^ ^ 

13 

C9-99: An HCA responder shall generate RDMA READ response packet 14 
payload lengths which are consistent with the opcode as follows: 1 5 

16 

1 ) A packet with an opcode of "RDMA READ response only" shall be ^ ^ 
zero to (PMTU) bytes long. 

1 8 

2) A packet with an opcode of "RDMA READ response middle" shall be ^ g 
exactly (PMTU) bytes long. 20 

3) A packet with an opcode of "RDMA READ response last" shall be 21 
one to (PMTU) bytes long. 22 

4) Zero length RDMA READ requests are permitted. 23 

5) A response to a zero length RDMA READ request shall consist of a 
single packet with an opcode of "RDMA READ response only". 25 

Of: 

09-61: If a TCA implements RDMA functionality, it shall generate re- 
sponse packets with payload lengths as described in the previous compli- 
ance statement. 28 

29 

09-1 00: If an HCA responder detects an error while in the process of re- 30 
turning a multi-packet RDMA READ response, it shall force a premature 3^ 
termination of the RDMA READ response by not transmitting any of the ^2 
errant payload data and forcing the opcode of the packet on which the 
error occurred to "acknowledge" instead of an opcode of "RDMA READ 
response last". The appropriate NAK code is inserted. 

35 

o9-62: If a TCA implements RDMA functionality, and detects an error 36 
while in the process of returning a multi-packet RDMA READ response, it 37 
shall force a premature termination of the RDMA READ response by not 33 
transmitting any of the errant payload data and forcing the opcode of the 
packet on which the error occurred to "acknowledge" instead of an opcode 
of "RDMA READ response last". The appropriate NAK code is inserted. 

41 
42 
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Due to the relaxed ordering rules for RDMA READ Requests, the re- 
sponder is permitted to begin executing one or more SEND or RDMA 
WRITE requests that arrive after the RDMA READ request. 

C9-101 : For an HCA, before executing any of the requests following the 
RDMA READ request, the header fields of the RDMA READ request must 
be validated. These requests must not be acknowledged until the out- 
standing RDMA READ responses have been sent. 

o9-63: Before executing any of the requests following the RDMA READ 
request, the header fields of the RDMA READ request must be validated. 
These requests must not be acknowledged until the outstanding RDMA 
READ responses have been sent. 



Requester 
RDMA READ Request-^ 

RDMA READ Requests 
RDMA WRITE Request-^ 



request 



rl: RUMARD 
r4: RDMA RU 
r5: RDMA WR 



Requester's 
Send Queue 



'r' is a request packet, 

'a' is an acknowledge message 

'n' is a negative acknowledge message 




Responder 

Responder begins executing rl . 
r1 will require 3 response packets. 
Responder begins executing r4 

While executing r1 & r4, responder 
may begin executing r5 

No response has been returned 
for r4 or r5 yet, because r1 
has not yet completed 

While returning responses to r1, 
responder detects R_Key violation 
on r4 



Responder NAKs r4 after 
ACK'ing r1. 

Responder must not acknowledge r5 



Figure 89 Relaxed Ordering Rules for RDMA READs 
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9.7.5.1,4 Acknowledging Duplicate Requests 1 

After validating a duplicate request, the responder's in response to a du- 2 
plicate request packet is as follows: 3 

4 

C9-102: If the packet is valid, the responder shall generate a response, g 

C9-103: Throughout the processing of the duplicate request, the re- 
sponder shall not update its expected PSN; it remains set to the value it ^ 
had prior to the arrival of the duplicate request. This is true even if the re- 8 
sponder detects an error while in the process of generating the response 9 
to the duplicate request. 10 

11 

Following generation of the appropriate response (as described in the 
next paragraphs), the responder resumes waiting for a new inbound 
packet with a PSN matching its expected PSN. 

14 

It is possible that the responder will receive another duplicate request 1 5 
while waiting for a new inbound packet. This is perfectly valid, and should 1 6 
be treated as simply another duplicate request. Furthermore, since it is a 
duplicate request, there is no requirement that the next request received 
be in sequential PSN order with the first duplicate request. However, the 
responder is required to maintain the same ordering rules for generating 
responses to duplicate requests as are required for normal new requests. ^0 

21 

C9-104: In particular, a duplicate RDMA READ or ATOMIC Operation re- 22 
quest must be acknowledged with an explicit response prior to returning 23 
acknowledges for subsequent duplicate SEND or RDMA WRITE re- 24 
quests. 25 

This is illustrated in Fioure 90 Maintaining the Order of Responses to Du- 
plicate Requests on page 268 . 27 
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Requester 
RDMA READ request: PSN=1 - 

Send request: PSN=2 



Responder 



duplicate RDMA request: PSN=1 
duplicate Send request: PSN=2 





RDMA READ response for r1 
-response for r2 



resend RDMA READ response for rl 
resend response for r2. 



'r' is a request packet 

'a' is an acknowledge packet (message) 



Responder must return response to 
duplicate RDMA READ request rl 
before it can return response to dupli- 
cate SEND request r2. 



Figure 90 Maintaining the Order of Responses to Duplicate Requests 

The response to be generated is a function of the duplicate request mes- 
sage as follows: 

• SEND or RDMA WRITE Request 

C9-105: For an HCA, or for a TCA with an inbound SEND request, the re- 
sponder shall not re-execute the request but only generates a response 
packet for the duplicate packet, pending responses for any outstanding 
duplicate RDMA READ requests or ATOMIC Operation requests. 
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09-64: If a TCA responder implements RDMA functionality, it shall not re- 1 

execute the RDMA WRITE request but only generate a response packet 2 

for the duplicate packet, pending responses for any outstanding duplicate 3 

RDMA READ requests or ATOMIC Operation requests. ^ 

The PSN of the acknowledge message may be either the same as the 
PSN of the duplicate request or it may be the PSN of the most recently 6 
completed request. One way to think of this process is as a logical ex- 7 
tension of the ability to coalesce acknowledges. Indeed, the requester, 8 
on receiving a response to a duplicate request, treats it exactly as it 9 
would any other coalesced acknowledge; any outstanding duplicate ^ q 
RDMA READ or ATOMIC Operation requests are considered to be 
NAK'ed. In this case, by returning the PSN of the most recently com- 
pleted request, the responder is informing the requester that it be- 
lieves it has already seen and executed all requests between the 
duplicate request and the most recently completed request. This is il- 14 
lustrated in Figure 91 . 15 

C9-106: For duplicate SEND or RDMA WRITE requests, if the responder 16 
detects an error while in the process of returning the response, it shall si- 1 7 
lently drop the duplicate request. This is done in order to avoid confusion ig 
with any possible outstanding NAKs to new requests. 
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Requester 
SEND request: PSN=1- 

SEND request: PSN=2 
SEND request: PSN=3 



Responder 



duplicate SEND request: PSN=1 




response PSN=1 



response PSN=3 



resent response a3 



'r' is a request packet 

'a' is an acknowledge packet (message) 



Figure 91 Acknowledging a Duplicate SEND Request 



• RDMA READ Request 

C9-107: An HCA responder must re-create the requested read response 
data. The resulting read data is returned to the requester in an RDMA 
READ response. The PSN of the first RDMA READ response packet shall 
be the same as the PSN of the duplicate request, with the PSNs for the 
subsequent response packets incrementing according to the normal rules 
for generating PSNs for RDMA READs. 

C9-108: If an HCA responder detects an error while re-executing a dupli- 
cate RDMA READ request before returning the first response packet, the 
responder shall silently drop the duplicate request. 

C9-109: If an HCA responder detects an error while re-executing a dupli- 
cate RDMA READ Request after returning one or more response packets, 
the RDMA READ response operation shall be aborted, i.e. no more re- 
sponse packets shall be returned. 
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C9-110: The HCA responder shall complete execution of an outstanding 1 
duplicate RDMA READ request or ATOMIC Operation request before re- 2 
spending to a subsequent duplicate SEND or RDMA WRITE request. In 3 
other words, duplicate RDMA READ requests or ATOMIC Operation re- ^ 
quests shall be executed in the order in which the duplicate request is re- 
ceived. ^ 

6 

09-65: If a TCA implements RDMA functionality, RDMA READ Re- 7 
sponses shall conform to the previous 4 compliance statements for HCAs. 8 

9 

Following the duplicate RDMA READ response, the responder may ^ q 
acknowledge any subsequent duplicate Send or RDMA WRITE re- 
quests with the PSN of the most recently completed request. This is 
Illustrated in Figure 92 ^ 
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Requester 
RDMA READ request: PSN=1 

Send request: PSN=2 
Send request: PSN=3 



duplicate RDMA RD req: PSN=1 
duplicate Send request: PSN=2 




'r' is a request packet 

'a' is an acknowledge packet (message) 



Responder 



RDMA READ response for r1 
Coalesced response for r2 and r3 



resend RDMA READ response for r1 

resend response for r3. This implicitly 
acknowledges the duplicate r2 



Figure 92 Acknowledging a Duplicate RDIVIA READ Request 



• ATOMIC Operation Request 

A given receive queue may have resources to support only a limited 
number of ATOMIC Operations. When a duplicate ATOMIC Operation 
request is received, the PSN of the duplicate request is compared to 
the PSNs of the recently executed ATOMIC Operations. 

o9-66: If the PSN of the duplicate ATOMIC Operation request matches 
exactly the PSN of one of the recently executed ATOMIC Operations, the 
saved results of that operation shall be returned to the requester. The re- 
sponder shall not re-execute the request. 
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9.7.5.1.5 Generating NAKs 



9.7.5.1.6 Acknowledge Message Scheduling 



09-67: If the PSN of the duplicate ATOMIC Operation request does not 1 

match the PSN of one of the recently executed ATOMIC Operations, the 2 

request is invalid and the duplicate request packet shall be silently 3 

dropped. This should never happen as long as the requester is observing ^ 
the limits on the number of outstanding ATOMIC Operation requests. 

5 

o9-68: If a local error prevents the responder from reproducing the orig- ^ 

inal ATOMIC Operation request data, the responder must silently drop the 1 

duplicate request. 8 

9 

In all cases, the PSN returned in the acknowledge message is the >jq 
PSN of the duplicate request. ^ ^ 

12 

There are several circumstances that cause a responder to generate a 
NAK. 

14 

09-111: In all cases except for RDM A READ requests, the PSN of the 
NAK packet shall contain the responder's expected PSN. ^6 

17 

09-112: In the case of an RDMA READ response packet, the PSN given 18 
in the NAK response packet shall point to the RDMA READ response ig 
packet which is being NAK'ed. 20 

21 

09-113: When generating an RNR NAK, the PSN of the response packet 
shall point to the PSN of the packet being RNR NAK'ed. ^2 

23 

Once the responder has returned a NAK-sequence error or an RNR NAK, 24 
it waits for the requester to send a packet with the responder's expected 25 
PSN. 26 

27 
28 

09-114: Once a NAK has been returned for a PSN sequence error, the 29 
responder shall ignore all other new requests, except duplicate requests, 30 
until it receives a valid request with a PSN that matches its expected PSN. 31 
It shall not return any other NAK packets, except in response to a valid re- 32 
quest with a PSN that matches its expected PSN. ^3 

09-11 5: The responder must continue to respond to duplicate requests as '^^ 
specified above. However, the responder shall not return a NAK in re- 
sponse to an error condition occurring while processing a duplicate re- 36 
quest. 37 

38 
39 

The scheduling of responses, per se, is not specified; however the re- 4Q 
quester may use the AckReq bit in the BTH to require the responder to 
schedule a response. 



The rules that the responder must follow are as follows: 
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9.7,5.1.7 Response Formats 



C9-116: When the responder receives a valid request packet with the 
AckReq bit set, it shall schedule a response packet for that request. 

There are several places where the AckReq bit can be very useful. For ex- 
annple, if the requester is sending the last packet of the last request WQE 
posted to the send queue, it is advisable for the requester to set the 
AckReq bit or use sonne other mechanism to force the responder to return 
an explicit response. If the requester does not do so, there is a possibility 
that the responder will choose to coalesce responses and thus not return 
an explicit response for that packet. Some other mechanisms that the re- 
quester can use to ensure that the responder returns an explicit response 
are to always set the AckReq bit on the last (or only) packet of every mes- 
sage, or to follow a given request with a NOP, or to retry the request for 
which an explicit response was desired. 

For SEND or RDMA WRITE requests, an ACK may be scheduled before 
data is actually written into the responder's memory. The ACK simply in- 
dicates that the data has successfully reached the fault domain of the re- 
sponding node. That is, the data has been received by the channel 
adapter and the channel adapter will write that data to the memory system 
of the responding node, or the responding application will at least be in- 
formed of the failure. 

The absence of the AckReq bit does not prohibit the responder from gen- 
erating a response packet. As always, RDMA READ and ATOMIC Oper- 
ation requests require explicit responses, thus the AckReq bit has no 
effect on requests. 



Responses may take one of three forms: 

1 ) An acknowledge packet for a normal SEND or RDMA WRITE opera- 
tions, 

2) RDMA READ responses, and 

3) Acknowledge messages for ATOMIC Operations - see 9.4.4 ATOMIC 
Operations on page 218 . 



The key distinctions between the three forms is that the normal acknowl- 
edge packet (used for SENDs and RDMA WRITEs) does not carry a pay- 
load field, while the responses for both the RDMA READ and ATOMIC 
Operations do. This observation impacts both the format of the response 
and the rules for coalescing acknowledges. 

An acknowledge packet contains the following information: 

• A syndrome used to notify the requester of the success or failure 
of a given request message, 
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• A PSN value used by the requester to correlate the acknowledge 1 
message with its listing of outstanding requests, 2 

• A Message Sequence Number used by the responder to notify 3 
the requester that request messages have been completed, 4 

• Optional End-to-End flow control credits, 5 

c 

• Payload data in the case of a RDMA READ response or ATOMIC 
Operation response. 7 

o 

Each of the three forms is discussed in the following sections. 

9 

9.7.5.1.8 Response Format for SEND or RDMA WRITE Requests 10 

This format is used to acknowledge SEND request packets or RDMA 11 
WRITE request packets. A normal acknowledge message comprises a 12 
single packet, and is shown in Figure 93. 13 

14 
15 
16 
17 

note 1 : GRH may or may not appear, depending on the LRH Next ^ g 
Header field 

1 9 

note 2: RDETH appears only for reliable datagram operations 
note 3: DETH, RETH, EOP, PYLD and IMM fields are prohib- 

21 

Figure 93 Response Format for SENDs, RDMA WRITEs 22 

9.7.5.1 .9 RDMA READ RESPONSES 23 

This response format, called a RDMA READ response, is used to ac- 24 
knowledge RDMA READ requests. A RDMA READ response message 25 
consists of one or more packets. 26 

27 

C9-117: For an HCA, the PSNs of the RDMA READ response packets 23 
must be sequential and monotonically increasing. If the response mes- 
sage consists of more than one packet, the first and last packets of the re- 
sponse message must contain an Acknowledge Extended Transport 30 
Header (AETH). 31 

32 

09-69: If a TCA implements RDMA functionality, the PSNs of the RDMA 33 
READ response packets must be sequential and monotonically in- ^4 
creasing. If the response message consists of more than one packet, the 
first and last packets of the response message must contain an Acknowl- 
edge Extended Transport Header (AETH). 

37 

C9-118: For an HCA, if the response message contains only a single 38 
packet (an "only" packet), then that packet must contain an AETH. This is 39 
illustrated in Figure 94. 4q 

41 
42 
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o9-70: If a TCA implements RDMA functionality, and the response mes- 
sage contains only a single packet (an "only" packet), then that packet 
must contain an AETH as shown in Figure 94. 



Packet 1 



Packet 2 



Packet n 



opcode="first" or "only" ^ 

opcode="middle"- 



— ' opcode="last" ' 

Each RDMA READ Response message comprises one or more packets. 



LRH 


GRH 


BTH 


RDETH 


PYLD 


ICRC 


VCRC 



Format for all middle packets. PYLD must be (PMTU) bytes 



LRH 


GRH 


BTH 


RDETH 


AETH 


PYLD 


ICRC 


VCRC 



Format for first, last or only RDMA READ Response Packet. 

If a first packet, PYLD shall be (PMTU) bytes long. 

If an only packet, PYLD shall be zero to (PMTU) bytes long. 

If a last packet, PYLD shall be one to (PMTU) bytes long. 

note 1 : GRH may or may not appear, depending on the LRH:Next Header field 
note 2: RDETH appears only for reliable datagram operations 
note 3: DETH, RETH, and IMM headers are prohibited 

Figure 94 Acknowledge Message Format for RDMA READ Requests 

A RDMA READ Response message, besides acknowledging the RDMA 
READ request Itself, also Implicitly acl<nowledges requests preceding the 
RDMA READ request. The rules governing coalesced ACKs are given in 
Section 9.7.5.1.2 Coalesced Acknowledge Messages on page 263 . 

The arrival of either a first packet or an only packet triggers the implicit ac- 
knowledges of any outstanding request messages as specified in section 
9.7.5.1.2 Coalesced Acknowledge Messages on page 263 . This is done 
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9.7,5.2 AETH FORMAT 



in order to reduce the latency to complete any outstanding request mes- 
sages. 

The arrival of a last packet or an only packet triggers the explicit acknowl- 
edge of the RDMA READ request itself. 



Acknowledge syndromes are carried in the AETH of the acknowledge 
message. The table below illustrates the syndrome field of the AETH. 

C9-119: When generating an AETH, a HCA responder implementing RC 
service shall encode the AETH Syndrome Field as shown in Table 43 
AETH Svndrome Field on page 277 . 

o9-71 : If a TCA responder implements RC service, or if a CA responder 
implements RD service, the responder shall encode the AETH Syndrome 
Field as shown in Table 43 AETH Svndrome Field on pace 277 . 

Table 43 AETH Syndrome Field 





bit? 


bits 6:5 


bits 4:0 


Definition 


MSN 
valid 


0 


00 


ccccc 


ACK (RC service only) (C CCCC = credit count) 


0 


0 1 


1 1 1 1 1 


RNR NAK (T TTTT = timer value) 


0 


1 0 


xxxxx 


reserved 


0 


1 1 


N NNNN 


NAK (N NNNN = NAK code) 


MSN 
invalid 




00 


0 0000 


ACK (RD service only) 




00 


0 0001 -1 1111 


reserved 




0 1 


TTTTT 


RNR NAK (T TTTT = timer value) 




10 


XXXXX 


reserved 




11 


N NNNN 


NAK (N NNNN = NAK code) 



The msb of the syndrome indicates whether the MSN field of the AETH 
contains a valid MSN value or not. The details of the MSN field are de- 
scribed in Section 9.7.7.1. 

o9-72: If a CA implements Reliable Datagram service, bit 7 of the AETH 
Syndrome Field shall always be set to one by the responder CA and ig- 
nored by the requester CA. 

The interpretation of bits [4:0] depends on the code contained in bits [7:5]. 
Bits [4:0] may contain a positive acknowledgment with end-to-end flow 
control credits, an RNR NAK timer value, a positive acknowledgment 
without end-to-end credits, or a NAK code. 

C CCCC = encoded end-to-end flow control credits 
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T TTTT = RNR NAK Timer Field - see Table 45 Encoding for RNR NAK 
Timer Field on page 283 

N NNNN = NAK Code - see Table 44 NAK Codes on page 278 

Code 100 0 0000 (MSN invalid, ACk) is defined for RD service only, since 
that is the only time when MSN should be invalid. 

Code 011 N NNNN (MSN valid, NAK) allows MSNs to be carried with NAK 
packets. 

Acknowledge syndromes are carried in the AETH of the acknowledge 
message. The table below illustrates the syndrome field of the AETH. 

9.7,5,2.1 End-to-End Flow Control Credit Field 

If bits [7:5] of the AETH Syndrome field are zero, then bits [4:0] of the 
AETH Syndrome field carries encoded end-to-end flow control credits 
from the responder to the requester. This field is only valid for reliable con- 
nections.The encoding 5b11111 means that the credit field is not valid. 
This encoding is also used for cases where the receive queue does not 
support End-to-End credits. See Section 9.7.7.2 End-to-End (Message 
Level) Flow Control on page 296 for further details. 



9.7.5.2.2 NAK Codes 



If bits [6:5] of the AETH Syndrome field are b1 1 , then bits [4:0] carry a NAK 
code. The code guides the requester in selecting a recovery strategy. The 
following sections describe all the possible NAK Codes. Even though an 
RNR NAK has its own AETH syndrome (AETH[6:5] = bOI), RNR NAK is 
also described in this section. 

The list of valid NAK codes is provided in Table 44. 

Table 44 NAK Codes 



NAK Code 
(AETH bits 4:0) 


Definition 


0 0000 


PSN Sequence Error 


0 0001 


Invalid Request 


0 0010 


Remote Access Error 


0 0011 


Remote Operational Error 


0 0100 


Invalid RD Request 


00101 -1 1111 


reserved 



C9-120: If a requester receives an acknowledge message containing a re- 
served code, it shall consider the acknowledge packet to be malformed 
and shall silently drop it. This may eventually cause the requester to time 
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9.7.5.2.3 PSN Sequence Error 



9.7.5.2.4 Remote Access Error 



out while waiting for the missing acknowledge packet, at that time it will 
either re-transmit the original request message, or stop operations on that 
send queue. 



A sequence error occurs when a responder detects a packet that is out of 
PSN sequence, i.e. a PSN value that is neither equal to the expected PSN 
nor within the valid duplicate packet range. 

C9-121: The responder, when it constructs NAK packet in response to a 
sequence error, must insert its expected PSN value in the PSN field of the 
BTH. This lets the requester back up its send queue to at least the point 
of the failure and begin re-sending request packets from that point for- 
ward. 

A PSN sequence error may be retried by the requester a number of times. 
Once the retry count has expired, the requester's transport notifies its 
client that it did not succeed in transferring the message. The requester's 
required behavior once its retry count has expired is given in 9.9.2 Re- 
quester Side Error Behavior on page 337 . The following discussion spec- 
ifies the behavior before the retry count has expired. 

When the responder detects a sequence error there is no impact on the 
receive queue nor are any WQEs consumed. Instead, the receive queue 
simply returns the NAK packet to the requester and resumes waiting for 
an inbound request packet with the correct PSN value. 

C9-122: Once a NAK packet for a sequence en'or has been returned to 
the requester, the responder shall discard all subsequent requests that do 
not contain the responder's expected PSN, except for valid duplicate re- 
quests. 

C9-123: If the responder receives a request packet with a PSN that is log- 
ically less than its expected PSN (i.e. a valid duplicate request packet), it 
shall respond to that request according to the rules for duplicate packet 
processing. 



A R_Key violation is caused by any or all of the following conditions for 
either a RDMA READ, RDMA WRITE, or ATOMIC Operation: 

• The R_Key field of the RETH is invalid. 

The virtual address and length or type of access specified is out- 
side the locally defined limits associated with the R_Key. 

C9-124: For an HCA responder, when reporting an RDMA remote access 
error, the BTH field of the acknowledge message must contain the PSN of 
the request packet that caused the remote access error. 
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o9-73: If a TCA responder implements RDMA functionality, or if a CA re- 1 

sponder supports ATOMIC operations, then when reporting a remote ac- 2 

cess error, the BTH field of the acknowledge message must contain the 3 

PSN of the request packet that caused the remote access error. ^ 

The responder's behavior on detecting an access error, beside generating ^ 
a NAK-Remote Access Error packet, is specified in section 9.9.3 Re- ^ 
sponder Side Behavior on page 349 . 7 

8 

The requester's behavior on receiving a NAK-Remote Access Error is 9 
specified in section 9.9.2 Requester Side Error Behavior on page 337 . 

11 

9.7.5.2.5 Invalid Request 

12 

The requester has requested an operation that is outside the established 
usage of the transport service - generally, this is an OpCode that is not 
supported by the responder or a request whose length exceeds the avail- 14 
able receive buffer space. For example, an RDMA request transmitted to 1 5 
a responder that does not support RDMAs would cause an Invalid Re- is 
quest Error. An out-of-sequence OpCode may also cause a NAK-lnvalid 
Request depending on the particular service. ^ g 

C9-125: When reporting an invalid request, the BTH field of the acknowl- 
edge packet must contain the responder^s expected PSN value, i.e., the 20 
PSN of the request packet that contained the invalid request. 21 

22 

The responder's behavior upon detecting an invalid request, besides gen- 23 
erating a NAK-lnvalid Request, is given in section 9.9.3 Responder Side 24 
Behavior on page 349 . 

25 

The requester's behavior on receiving a NAK-lnvalid Request is given in ^6 
section 9.9.2 Requester Side Error Behavior on page 337 . 27 

28 

9.7.5.2.6 Remote Operational Error 29 

A remote operational error occurs when the responder encounters a situ- 30 
ation that prevents its receive queue from completing the current request. 3^ 
The list of error conditions detectable by the responder, and reportable as ^2 
a remote operational error, is not specified since it is implementation spe- 
cific. Remote operational errors cannot be caused by anything the re- 
quester may have done. Rather, they reflect a fault in the responder. 34 

35 

C9-126: When reporting a remote operational error, the BTH field of the 36 
acknowledge message must contain the PSN of the request being exe- 37 
cuted at the time the responder detected the operational error. 33 

39 

The responder's behavior upon detecting an operational error, besides re- 
turning NAK-Remote Operational Error, is given in section 9.9.3 Re- 
sponder Side Behavior on page 349 . 41 

42 
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9.7.5.2.7 Invalid RD Request 



9.7.5.2.8 RNR NAK 



The requester's behavior when it receives a NAK-Rennote Operational 
Error is specified in section 9.9.2 Requester Side Error Behavior on page 
337. 



This NAK code is generated when the responder detects a Q_Key or RDD 
violation while operating in RD service, or if the destination QP is not con- 
figured for RD service, or if the destination QP is not in a state where it can 
accept an inbound packet. 

o9-74: If the responder's EE Context detects an invalid P_Key, the re- 
quest packet shall be silently dropped by the EE Context. 

If no P_Key violation is detected, the EE Context forwards the packet to 
the receive queue specified in the BTH. 

09-1 27: If the QP as specified in the BTH is not configured for RD service, 
then a NAK-lnvalid RD Request shall be returned. 

The receive queue checks the Q_Key of the inbound request packet and 
also checks that its current RDD value matches that of the EE Context. 

If the responder's receive queue detects an invalid Q_Key, or if the re- 
ceive queue's RDD value does not match that of the EE Context, the re- 
sponder shall return a NAK-lnvalid RD Request to the requester. 

The responder's behavior upon detecting either a Q_Key or RDD viola- 
tion, beside generating a NAK-lnvalid RD Request, is specified in section 
9.9.3 Responder Side Behavior on page 349 . 

The requester's behavior in response to a NAK-lnvalid RD Request is 
specified in section 9.9.2 Reguester Side Error Behavior on page 337 . 



Under certain circumstances, a receive queue may be temporarily unable 
to accept an inbound request message. For example, there may not cur- 
rently be a valid receive WQE posted to the receive queue. When this oc- 
curs, the responder may return a response indicating Receiver Not Ready 
(RNR NAK). On receiving a RNR NAK, the requester may, after waiting 
for at least the interval specified in the RNR NAK, retry the same request. 
"The same request" means the precise same request message beginning 
with the same PSN as reported by the responder in its RNR NAK packet. 

09-128: For an HCA requester using Reliable Connection service, after 
receiving an RNR NAK, the requester shall not substitute a different re- 
quest message by reusing the same PSN. 
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o9-75: If a TCA requester implements Reliable Connection service, after 1 
receiving an RNR NAK, the requester shall not substitute a different re- 2 
quest message by reusing the same PSN. 3 



4 
5 



For Reliable Datagram service, the requester may either exactly repeat 
the request, or abandon the request and start a new message from a dif- 
ferent QR If the latter is done in the middle of a message, the responder ^ 
may have to abort the receive WQE and start a new one. 7 

8 

C9-129: An HCA responder using Reliable Connection service, when 9 
generating an RNR NAK, shall indicate the appropriate interval in the 
timer field of the AETH. The value loaded in the timer field of the AETH 

1 1 

shall be as shown in Table 45 Encoding for RNR NAK Timer Field on page 
283 . 

13 

o9-76: If a CA responder implements Reliable Datagram service, or if a 14 
TCA implements Reliable Connection service, it shall follow this rule: 15 
when generating an RNR NAK, the responder shall indicate the appro- 
priate interval in the timer field of the AETH. The value loaded in the timer ^ ^ 
field of the AETH shall be as shown in Table 45 Encoding for RNR NAK 
Timer Field on pace 283 . 

19 

09-1 30: An HCA requester using Reliable Connection service, after re- 20 
ceiving a RNR NAK, must wait for at least the interval specified in the timer 21 
field of the AETH before retrying the request. If the requester fails to wait 22 
for the appropriate timeout interval before re-trying the request, the re- 23 
sponder may silently drop the packet. ^4 

09-131 : An HCA requester using Reliable Connection service, after re- 

ceiving a RNR NAK, must wait for at least the interval specified in the timer 26 

field of the AETH before retrying the request. If the requester fails to wait 27 

for the appropriate timeout interval before re-trying the request, the re- 28 

sponder may silently drop the packet. 29 



30 
31 



09-132: An HCA requester using Reliable Connection service shall main- 
tain a 3 bit retry counter which is loaded during connection establishment 
with information provided by the responder. This counter is used to limit ^2 
the number of times a requester can retry an operafion which was RNR 33 
NAK*ed. When a RNR NAK response is received, if the RNR NAK retry 34 
counter is not equal to 7 (indicates infinite retry), the requester shall dec- 35 
rement the RNR NAK retry counter. Thereafter, when the retry timer ex- 
pires, if the retry counter is non-zero, the requester may re-issue the 
request. 

38 

o9-77: If a CA requester implements Reliable Datagram service, or if a 39 
TCA requester implements Reliable Connection Service, it shall maintain 40 
a 3 bit retry counter which is loaded during connection establishment with 41 
information provided by the responder. This counter is used to limit the 42 



InfiniBand^'^ Trade Association 



Page 282 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Transport Layer 



October 24, 2000 
FINAL 



number of times a requester can retry an operation which was RNR 
NAK'ed. When a RNR NAK response is received, if the RNR NAK retry 
counter is not equal to 7 (indicates infinite retry), the requester shall dec- 
rement the RNR NAK retry counter. Thereafter, when the retry timer ex- 
pires, if the retry counter is non-zero, the requester may re-issue the 
request. 

A locally detected error is recorded by the requester if the retry counter 
has decremented to zero at the time that the RNR NAK retry timer expires. 
See Section 9.9.2.1 Requester Side Error Detection - Locallv Detected Er- 
rors on page 337 for further details. 

The timer field is encoded as shown in the Table below. 

Table 45 Encoding for RNR NAK Timer Field 



RNR 


Delay in 


RNR 


Delay in 


Time 


milliseconds 


Time 


milliseconds 


00000 


655.36 


10000 


2.56 


00001 


0.01 


10001 


3.84 


00010 


0.02 


10010 


5.12 


00011 


0.03 


10011 


7.68 


00100 


0.04 


10100 


10.24 


00101 


0.06 


10101 


15.36 


00110 


0.08 


10110 


20.48 


00111 


0.12 


10111 


30.72 


01000 


0.16 


11000 


40.96 


01001 


0.24 


11001 


61.44 


01010 


0.32 


11010 


81.92 


01011 


0.48 


11011 


122.88 


01100 


0.64 


11100 


163.84 


01101 


0.96 


11101 


245.76 


01110 


1.28 


11110 


327.68 


01111 


1.92 


11111 


491.52 



The use of RNR NAK for temporary problems that do not affect the whole 
message (such as a memory page not present) is not prohibited. In par- 
ticular, for Reliable Datagram service, an RNR NAK returned in the middle 
of a SEND request message by a responder may result in the current 
message being abandoned by the requester and a new message being 
sent from another queue pair. This may result in unexpected incomplete 
messages at the responder. These incomplete messages are detected by 
the responder as an OpCode sequence error, thus allowing the responder 
to complete the partially completed WQE in error and begin receiving the 
new request. 
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A responder should use this feature as a mechanism to delay the in- 
coming request when a local resource is unavailable only rarely. The RNR 
NAK mechanism consumes bandwidth in that an incoming packet will be 
aborted and will have to be re-sent. 

9.7.6 Requester: Receiving Responses 
9.7.6.1 Validating Inbound Response Packets 

On receipt of an inbound acknowledge packet, a requester validates the 



K 



packet as follows (reference the following figure): 

Range of Packet Sequence Numbers = 0 to 16,777,215 
Ghost Ack Region 



224 



^ Expected Response: range < 8,388,608 PSN^ 



Outstanding Ack Region 




PSN of oldest Unacknowledged request 
FSN of most recently acknowledged request 
(used for unsolicited Ack credit recovery) 




Requester's Next PSN 

Requester's Most recently 
Sent PSN 



Figure 95 Valid and Invalid Ack PSN Regions 

C9-133: To verify the integrity of the packet, the requester shall validate 
the packet as specified in Section 9.6 Packet Transport Header Validation 
on page 228 . Invalid packets shall be silently dropped by the requester. 

C9-134: For an HCA requester using Reliable Connection service, the 
PSN shall be examined to detect out of order packets. Since acknowl- 
edges may be coalesced as described in section 9.7.5.1.2 Coalesced Ac- 
knowledge Messages on page 263 . the PSN is used to detect coalesced 
responses. 

o9-78: If a TCA requester implements Reliable Connection service, or if a 
CA requester implements Reliable Datagram service, the PSN of each ac- 
knowledge packet shall be examined to detect out of order packets. Since 
acknowledges may be coalesced as described in section 9.7.5.1 .2 Coa- 
lesced Acknowledge Messages on page 263 . the PSN is used to detect 
coalesced responses. 

09-135: For an HCA requester using Reliable Connection service, the va- 
lidity of the acknowledge syndrome shall be checked according to the 
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table in Section 9.7.5.2.2 NAK Codes on page 278 . A response packet 1 
containing a reserved NAK code shall be simply dropped. 2 

3 

o9-79: If a TCA requester implements Reliable Connection service, or if a ^ 
CA requester implements Reliable Datagram service, the validity of the 
acknowledge syndrome shall be checked according to the table in Section 
9.7.5.2.2 NAK Codes on page 278 . A response packet containing a re- ^ 
served NAK code shall be simply dropped. 7 

8 

If the packet is determined to be valid, it is processed by the requester. g 
While processing the acknowledge packet, the requester may encounter 
local errors. The list of local errors that the requester may encounter when 
processing the acknowledge message is not specified since it is imple- 
mentation specific, but includes any error due to a fault on the requester 
side. The required behaviors for this case are specified in 9.9.2 Reouester 1 3 
Side Error Behavior on page 337 . 14 

15 

As is the case with request packets, each response packet carries a PSN. ^ g 
The requester, on receiving a response packet, checks the PSN to deter- 
mine if the response is an expected response or a ghost acknowledge 
packet. Conceptually, the requester keeps track of the PSN of the oldest 
unacknowledged request packet and the PSN of the most recently sent ^9 
request. These two PSNs define the endpoints of a range of PSNs. If the 20 
PSN of a response packet falls within that range then the packet is an ex- 21 
pected response packet. If the response does not fall within that region, 22 
then the response is considered a ghost acknowledge packet and is 23 
dropped by the requester. ^4 

09-136: For an HCA requester using Reliable Connection service, ghost 
acknowledge packets shall be dropped by the requester. 26 

27 

o9-80: If a TCA requester implements Reliable Connection service, or if a 28 
CA requester implements Reliable Datagram service, ghost acknowledge 29 
packets shall be dropped by the requester. 2^ 

9.7.6,1.1 Requester Response to a NAK Message 

The requester's reaction to a negative response message depends on the 
NAK code that is returned, and whether the queue pair is configured for 33 
reliable connected or reliable datagram service. 34 

35 

A NAK-Sequence error triggers an automatic retry of the request. The 35 
PSN in the response packet is the requester's indication of the request 
packet that the responder believes it missed, thus, the requester can retry 
that request. To prevent the requester from retrying the same request for- 
ever, the requester maintains a 3 bit retry counter which is used to count ^9 
the number of times a particular request packet has been retried and 40 
timed out. See Section 9.9.2.1 Reouester Side Error Detection - Locallv 41 
Detected Errors on page 337 for a full description of the retry counter. 42 
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For other NAK packets, the response of the send queue depends on 
whether the queue is providing Reliable Datagram or Reliable Connected 
service. 



a) It nnay retry the same failed packet from the same QP, or 



10 
11 



C9-137: For an HCA requester using Reliable Connection service, the re- 1 
quester shall decrement its 3 bit retry counter each time the responder re- 2 
turns a NAK-Sequence error for a given request packet. The counter shall 3 
be re-loaded whenever the given outstanding request is cleared. If auto- ^ 
matic path migration is not supported, and if a NAK-Sequence error is re- 
turned once more, then the requester shall declare a locally detected 
error. 6 

7 

o9-81: If a TCA requester implements Reliable Connection service, the 8 
requester shall decrement its 3 bit retry counter each time the responder 9 
returns a NAK-Sequence error for a given request packet. The counter 
shall be re-loaded whenever the given outstanding request is cleared. If 
automatic path migration is not supported, and if a NAK-Sequence error 
is returned once more, then the requester shall declare a locally detected ^ ^ 
error. 1 3 

14 

09-82: If an HCA implements automatic path migration, or if a TCA imple- 1 5 
ments both automatic path migration and Reliable Connection service, 
then the following is required. If a NAK-Sequence error is returned after 
the retry counter has decremented to zero, then the channel adapter shall 
attempt an automatic path migration. Following the automatic path migra- 
tion, the requester shall reload the retry counter and begin the process 
over again. If the requester still does not succeed in sending the request 20 
after several retries, then the requester shall declare a locally detected 21 
error. 22 

23 
24 
25 
26 

Reliable Datagram Behavior: Reliable datagrams require the use of an 27 
EE Context that maintains the packet sequence numbers and thus en- 28 
sures reliable delivery of requests. The rules for responding to a NAK en 
sure that the current PSN at the requester and the expected PSN at the 
responder remain in sync. Therefore, the connection between the re- 
quester's EE Context and responder's EE Context survives. This allows 
the connection to continue to service other Send/Receive QPs. ^2 

33 

Depending on the cause and the operation in question, the EE context 34 
may undertake any of the following options after detecting a failed request 35 
packet: 26 

37 
38 

b) It may de-schedule the QP upon which the error was detected 
and schedule the next available QP. 

40 

If the "same failed packet" is to be retried, the requester is not required to 4^ 
begin its retransmission sequence beginning with the PSN indicated in the 42 
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responder's NAK; instead, it may begin its retransmission with an earlier 1 
request packet. These earlier request packets are treated by the re- 2 
sponder as normal duplicate packets causing no ill side effects. 3 

4 
5 
6 

C9-138: For an HCA requester using Reliable Connection service, the re- 7 
quester must receive and discard any duplicate acknowledge messages 8 
with no ill side effects. 9 



See section 9.9 Error detection and handling on oaoe 336 for a complete 
description of the errors and the EE Contexts subsequent behavior. 



10 
11 



o9-83: If a TCA requester implements Reliable Connection service, or if a 
CA requester implements Reliable Datagram service, the requester must 

receive and discard any duplicate acknowledge messages with no ill side ^ 

effects. 13 

14 

Reliable Connected Behavior: For reliable connections, the requester 15 

has only two possible alternatives when it receives a NAK. It may either >|g 
retry the same request packet, or it may mark the current WQE as com- 
pleted in error and notify its client. Note that not all NAKs can be retried. 

If the requester retries the same request packet, it is not required to begin ^ ^ 

its retransmission sequence beginning with the PSN indicated in the re- 20 

spender's NAK; instead, it may begin its retransmission with an earlier re- 21 

quest packet. These earlier request packets are treated by the responder 22 

as normal duplicate packets causing no ill side effects. 23 



24 
25 



9,7.6,1,2 Detecting Lost Acknowledge Messages and Timeouts 

Under some error conditions the requester does not receive an acknowl 
edge message in response to one or more of its requests. This can occur 
for one of three reasons: 27 

28 

1 ) The responder generated an acknowledge message that was subse- 29 
quently lost in the fabric, or, 3q 

2) The responder failed for some reason preventing it from generating 31 
an acknowledge message, or 32 

3) The original request message was lost in the fabric before it was re- 33 
ceived by the responder. 34 

All three of these conditions are detected by the requester as a lost ac- 
knowledge message. 36 

37 

Often, these errors are corrected automatically due to acknowledge coa- 33 
lescing; the next acknowledge received by the requester serves to implic- 39 
itiy acknowledge all outstanding requests. 

41 
42 
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However, there are several cases where a lost acknowledge message is 1 

not automatically recovered by the coalesced acknowledge rules. For ex- 2 

ample, a NAK message lost in the fabric will not be resolved via acknowl- 3 

edge coalescing because the responder side rules require that the ^ 
responder may have no more than one NAK message outstanding at a 
given time. 

6 

C9-139: For an HCA requester using Reliable Connection service, to de- 7 

tect missing responses, every Send queue is required to implement a 8 

Transport Timer to time outstanding requests. 9 

10 
11 
12 
13 

o9-85: If a CA requester implements Reliable Datagram service, to detect 1 4 
missing responses, every EE Context is required to implement a Trans- 15 
port Timer to time outstanding requests. 1 5 

17 

Because of variabilities in the fabric, scheduling algorithms and architec- 
ture of the channel adapters and many other factors, it is not possible, nor 
desirable, to time outstanding requests with a high degree of precision. 
Nonetheless, the Transport Timer is an integral element of the ACK/NAK 20 
protocol by providing a deterministic means to detect lost requests or re- 21 
sponses. 22 

23 

The requester need not separately time each request launched into the ^4 
fabric, but instead simply begins the timer whenever it is expecting a re- 
sponse. Once started, the timer is restarted each time an acknowledge 

packet is received as long as there are outstanding expected responses. 26 

The timer does not detect the loss of a particular expected acknowledge 27 

packet, but rather simply detects the persistent absence of response 28 

packets. 29 

30 
31 

• the time since the requester sent a packet with the AckReq bit set in 32 
the BTH, 33 

• or the time since the last valid acknowledge packet arrived. 

35 

The operation is as follows. 

36 

The requester starts the timer running whenever the timer is not currently 37 
running AND: 38 

39 
40 
41 
42 



The timer measures the lesser of: 
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13 
14 
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1 ) The requester sets the AckReq bit in a Send or RDMA WRITE re- 1 
quest or, 2 

2) The requester generates an RDMA READ request or, 3 

4 

3) The requester generates an ATOMIC Operation request. 

5 

Thereafter, the requester restarts the timer each time it receives a new in- g 
bound acknowledge packet as long as there are still outstanding expected 
responses. ^ 

The timer is stopped whenever there are no outstanding expected re- 9 

sponses. 10 

11 

An "expected response" is created by the requester by setting the AckReq -| 2 
bit in a request packet or by generating an RDMA READ request or an 
ATOMIC Operation request. An outstanding expected response is a re- 
sponse to any request packet which has the AckReq bit set in the BTH, or 

any RDMA READ request or ATOMIC Operation request, which has not 15 

been acknowledged. 1 6 

17 

As specified in the Software Transport Interface Chapter each QP has a 13 

single Local ACK Timeout value associated with it which is used to derive ^ g 

the Transport Timer timeout interval Ttr. 2o 

21 

09-1 40: For an HCA requester using Reliable Connection service, the 
Transport Timer timeout interval, Ttr shall be defined to be 4.096 uS * 
2(Local ACK Timeout) l^^qI ACK Timeout shall be a 5 bit value, with zero 
meaning that the timer is disabled. The minimum acceptable value of 
Local ACK Timeout, other than zero, shall be defined by the CA vendor. If 25 
a non-zero Local ACK Timeout value is loaded in QP context which is less 26 
than the minimum supported by the CA, then the CA may use its minimum 27 
value. 28 

29 
30 
31 
32 

Thus, Ttr <= To <= 4Ttr. 33 

34 

Once a timeout for a given request packet is detected, the requester may 35 
retry the request. 3g 

37 

09-142: For an HCA requester using Reliable Connection service, to pre- 
vent the requester from retrying the request forever, the requester shall 
maintain a 3 bit retry counter which is used to count the number of times 39 
a particular request packet has been retried and timed out. This counter 40 
shall be decremented each time the transport timer expires for a given re- 41 

42 
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quest packet. The counter shall be re-loaded whenever a given out- 1 
standing request is cleared. 2 

3 

See Section 9.9.2.1 Requester Side Error Detection - Locallv Detected Er- ^ 
rors on page 337 for a full description of the retry counter. 

5 

C9-143: For an HCA requester using Reliable Connection service, if au- ^ 
tomatic path migration is not supported, and if the transport timer expires 7 
after the retry counter has decremented to zero, then the requester shall 8 
declare a locally detected error. g 



10 
11 



o9-86: If automatic path migration is supported, and If the transport timer 
expires after the retry counter has decremented to zero, then the channel 
adapter shall attempt an automatic path migration. Following the auto- 
matic path migration, the requester shall reload the transport timer retry 1 3 
counter and begin the process over again. If the requester still does not 14 
succeed in sending the request after several retries, then the requester 1 5 
shall declare a locally detected error. ^ 5 



17 
18 



o9-87: If a TCA requester implements Reliable Connection service, then 
the five preceding HCA compliance statements (that is, timeout rules for 

outstanding requests) shall be applicable to that TCA, ^ ^ 

20 

o9-88: If a CA requester implements Reliable Datagram service, then the 21 

five preceding HCA compliance statements (that is, timeout rules for out- 22 

standing requests) shall be applicable to that CA. In that case, the func- 23 
tionality described applies to the EE Context rather than the Queue Pair. 

on 

9.7.6.1 .3 Duplicate Acknowledgements 

9.7.1 Packet Sequence Numbers fPSN) on pace 240 describes how du- 

plicate requests are generated. These requests may result in duplicate 27 

acknowledgments being returned by the responder. The responder may 28 

also send unsolicited Acks that appear to be "Ghost Acks" from the point 29 

of view of the requestor. 3q 

31 

C9-144: For an HCA requester using Reliable Connection service, if the 
responder is configured to generate end-to-end flow control credits, then 
the requester must extract end-to-end flow control credits from a duplicate 33 
acknowledgment. 34 

35 

o9-89: If a TCA implements Reliable Connection service, and if the re- 35 
sponder is configured to generate end-to-end flow control credits, then the 3^ 
requester must extract end-to-end flow control credits from a duplicate ac- 
knowledgment. 



38 
39 



C9-145: For an HCA requester using Reliable Connection service, dupli- 40 
cate acknowledgments shall be discarded. 41 



42 



InfiniBand^"^ Trade Association 



Page 290 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Transport Layer 



October 24, 2000 
FINAL 



o9-90: If a TCA requester implements Reliable Connection service, dupli- 
cate acknowledgments shall be discarded. 

See Section 9.7.7.2 End-to-End (Message Level) Flow Control on page 
296 for a complete description. 



9.7.7 Reliable Connections 



A reliable connection is a connection created between a single local QP 
and a single remote QP and that can guarantee that messages are deliv- 
ered at most once, in order and without corruption (in the absence of un- 
recoverable errors) between the local and remote QPs. 

Table 46 Reliable Connected Service Characteristics 



Property / Level of Reliability 


Support 


Corrupt data detected 


Yes 


Data delivered exactly once (Except for an unrecover- 
able error - that is reported to the application) 


Yes 


Data order guaranteed 


Yes 


Data loss detected 


Yes 


RDMA Support 


Yes - both Read and Write 


State of Send/RDMA WRITE when request com- 
pleted 


Completion on remote end node 


ATOMIC Support 


Optional 


Multi-packet message support 


Yes 


Number of messages in flight per QP 


223 (maximum) 


Number of packets allowed in flight per QP 


223 (maximum) 


Number of messages enqueued per QP 


Implementation limited only 


Maximum Message Size 


2^** Bytes 



The desired reliability characteristics are provided by application of packet 
sequence numbers and the ACK/NAK protocol. 

C9-146: For an HCA, each QP configured for Reliable Connection service 
must conform to the requirements specified in section 9.7 Reliable Ser- 
vice on pace 238 . the characteristics given in Table 46 Reliable Con- 
nected Service Characteristics on page 291 . and any additional 
requirements given in this section. 

09-91: If a TCA implements Reliable Connection service, each QP con- 
figured for Realible Connection service must conform to the requirements 
specified in section 9.7 Reliable Service on pace 238 . the characteristics 
given in Table 46 Reliable Connected Service Characteristics on paae 
291 . and any additional requirements given in this section. 
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9.7.7.1 Generating MSN Value 1 

For Reliable Connected service, the Message Sequence Number is a 2 

number returned by the responder to the requester indicating the number 3 

of messages completed by the responder. The MSN is carried in the three 4 

least significant bytes of the AETH. The MSN is provided to the requester 5 

as a service to assist it in completing WQEs by informing the requester of g 

the messages that have been completed by the responder. ^ 

C9-147: An HCA responder using Reliable Connection service shall re- ^ 
turn an MSN in the AETH of every response packet. 9 

10 

o9-92: If a TCA responder implements Reliable Connection service, it ^ ^ 
shall return an MSN in the AETH of every response packet. ^2 

13 

Logically, the requester associates a sequential Send Sequence Number 
(SSN) with each WQE posted to the send queue. The SSN bears a one- 
to-one relationship to the MSN returned by the responder in each re- 1 5 
sponse packet. Therefore, when the requester receives a response, it in- 16 
terprets the MSN as representing the SSN of the most recent request 1 7 
completed by the responder to determine which send WQE(s) can be >t8 
completed. 

20 

Note that SSN as described above is a logical concept only which is given 
to convey the concept of how the MSN is applied; an implementation is 21 
not required to implement it as described. 22 

23 

Following initialization, the first WQE posted to the Send queue has an 24 
SSN of one assigned to it. The responder initializes its MSN counter to 25 
zero. Thereafter, the responder increments its 24-bit MSN value when- 
ever it completes execution of an inbound request message. This is illus- 
trated in Figure 96 below. 27 

28 

29 
30 
31 
32 
33 
34 
35 
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Figure 96 Responder Initializes MSN to Zero 

Requester Responder 



request: SSN=1, PSN=15 



request: SSN=1,PSN=16 



request 


SSN 










request 1 


OU 00 01 


request 2 


00 00 02 


request 3 


00 00 03 








response to r1, PSN=15, MSN=0 
response to r2, PSN=16, MSN=1 



Request 1 is a multi-packet send. 

Responder returns an acknowledge for 

each packet received. Response al6 

marks the completion of request 1 . 
Requester's Send Queue 

'r' is a request packet. 

'a' is an acknowledge packet. 

MSN is shown in parentheses. 

C9-148: An HCA responder using Reliable Connection service shall ini- 
tialize its MSN value to zero. The responder shall increment its MSN 
whenever it: 

1 ) Completes a receive WQE, or 

2) Successfully retires an RDMA WRITE request without immediate 
data, or 

3) Successfully retires an RDMA READ request by returning all packets 
of the Read Response message, or 

4) Successfully retires an ATOMIC Operation request by returning the 
ATOMIC response message. 

o9-93: If a TCA responder implements Reliable Connection service, it 
shall calculate and update MSN as described in the preceding compliance 
statement. 

9.7.7,1.1 Requester Behavior On Receiving a New MSN 

As described above, the existence of a new MSN value in a response 
packet may be used by the requester as a signal to complete certain 
WQEs posted to its send queue. Since the responder may choose to co- 
alesce acknowledges, a single response packet may in fact acknowledge 
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several request messages. Thus, when it receives a new MSN, the re- 
quester begins evaluating WQEs on its send queue beginning with the 
oldest outstanding WQE and progressing forward. This is illustrated in 
the figure below for the case where there are no outstanding RDMA READ 
requests or ATOMIC Operation requests on the send queue. 



Requester 
request: SSN=1, PSN=15 

request: SSN=2, PSN=16 

request: SSN=3, PSN=17 



Figure 97 Requester Behavior - Completing WQEs 

Responder 



completed 
WQEs 



SSN 











request 1 


00 00 01 


request I 


UU UU U2 


request 3 


UU UU U3 








response to r17, MSN=3 



Requester's Send Queue 



Requester completes WQEs for 
request 1, 2 and 3 inclusive. 



'r' is a request packet. 

'a' is an acknowledge packet. 

MSN is shown in parentheses. 



For the case where there are outstanding RDMA READ requests or 
ATOMIC Operation requests, the situation is slightly more complex. In this 
case, the requester only completes outstanding WQEs up to either the 
first outstanding RDMA READ request, ATOMIC Operation request, or 
WQE whose SSN matches the MSN in the response packet, whichever 
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comes first. This is because both RDMA READ requests and ATOMIC 

Requester Responder 
request: PSN=1, SSN=1- 

request: PSN=3, SSN=2-^ 
RDMA READ request: PSN=4, SSN=3 _^ 
request: PSN=7, SSN=4-^ 



request 



SSN 







SENDl 


00 00 01 


Sli!NU2 


00 00 02 


KHAUi 


00 00 03 


SbNU4 


00 00 04 








request is for 3 
response packets 



RDMA READ Response 



response to request r5 



Requester's Send Queue 

Since RDMA READs are loosely ordered, it is likely that the responder will "complete" 
SEND4 before it finishes returning the READ response data (a3, a4, a5). Nonetheless, 
a3 has an MSN of 4 indicating that it the responder has completed SEND1 , SEND2 and 
SEND4. 

However, the requester may only complete SEND1 and SEND2 because of the pres- 
ence of READS in the send queue. , , . ^ , 

r IS a request packet 

'a' is an acknowledge packet (message) 

\ MSN is shown inparentheses. 

Figure 98 Limitation on Completing Send Queue WQEs 

Operation requests require an explicit response and thus cannot be com- 
pleted until such an explicit response is received. 

C9-149: For an HCA responder using Reliable Connection service, the 
MSN counter shall be inserted in the AETH regardless of whether the re- 
sponse is a positive acknowledgment, a negative acknowledgment or a 
duplicate acknowledgment. 
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o9-94: If a TCA responder implements Reliable Connection service, the 1 

MSN counter shall be inserted in the AETH regardless of whether the re- 2 

sponse is a positive acknowledgment, a negative acknowledgment or a 3 

duplicate acknowledgment. . 



5 
6 

7 



19 
20 



9.7.7,2 End-to-End (Message Level) Flow Control 

IBA provides an end-to-end (or message level) flow control capability for 
reliable connections that can be used by a responder to optimize the use 

of its receive resources. Essentially, a requester cannot send a request ^ 

message unless it has appropriate credits to do so. 9 

10 

Encoded credits are transported from the responder to the requester in an 11 

acknowledge message in the Syndrome field of the AETH. The credits 2 

carried in the AETH are with respect to the MSN field of the same AETH; ^ ^ 
therefore proper interpretation of the credit field also requires interpreta- 
tion of the MSN field. See Section 9.7.5.2 AETH Format on pace 277 for 

a full description of the appropriate AETH fields. 1 5 

16 

Each credit represents the receive resources needed to receive one in- 17 
bound request message. Specifically, each credit represents one WQE 13 
posted to the receive queue. The presence of a receive credit does not, 
however, necessarily mean that enough physical memory has been alio 
cated. For example, it is still possible, even if sufficient credits are avail- 
able, to encounter a condition where there is insufficient memory available 21 
to receive the entire inbound message. 22 

23 

1 ) The end-to-end credit mechanism applies only to Reliable Connected 24 
service. 25 

2) End-to-End credits are generated by a responder's receive queue 26 
and consumed by a requester's send queue. 27 

3) Requirements on a CA for supporting end-to-end flow control are 28 
given in Chapter 17: Channel Adapters on page 790 . HCA receive 29 
queues must generate end-to-end credits, but TCA receive queues 30 
are not required to do so. If the TCA's receive queue generates End- 3^ 
to-End credits, then the corresponding send queue must receive and 
respond to those credits. 

33 

4) Credits are issued on a per message basis, without regard to the size 
of the message. 

5) End-to-End credits are carried in the AETH as an encoded 5-bit field. 35 

6) The responder may send credits to the requester asynchronously by 37 
using an Unsolicited acknowledge packet. An unsolicited ac- 38 
knowledge packet is created by re-sending the most recently sent ac- 39 
knowledge packet. 

C9-1 50: Each HCA receive queue shall generate end-to-end flow control 41 
credits. 42 



InfinlBand^'^ Trade Association 



Page 296 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Transport Layer October 24, 2000 

Volume 1 - General Specifications FINAL 

o9-95: Each TCA receive queue may generate end-to-end credits. 1 

2 

C9-151: If a TCA's given receive queue generates End-to-End credits, 3 
then the corresponding send queue shall receive and respond to those 
credits. This is a requirement on each send queue of a CA. 



Unsolicited Acknowledge Packet: 



4 

5 

9.7.7.2.1 Transferring Credits from Responder to Requester 6 

Two mechanisms are defined for transporting credits from the responder's ^ 

receive queue to the requester's send queue. The credits can be piggy- 8 

backed onto an existing acknowledge message, or a special unsolicited 9 

acknowledge message can be generated by the responder. Piggybacked 1 q 
credits are those credits that are carried in the AETH field of an already 
scheduled acknowledge packet. 



11 

12 



Piggybacked Credits: 

14 

Piggybacking of end-to-end credits refers to transferring credits to the re- 1 5 
quester in the AETH of a normal acknowledge packet. Credits are carried 1 6 
in AETH Syndrome[4:0]. Credits can be piggybacked onto any acknowl- -17 
edge packet when the MSN field in the AETH is also valid. 

19 
20 

An unsolicited acknowledge message appears to the requester like a du- 21 
plicate of the most recent positive acknowledge message. Since the 22 
ACK/NAK protocol prohibits the responder from sending duplicate nega- 23 
tive acknowledge packets (NAKs), an unsolicited acknowledge cannot be 24 
created by re-sending a NAK packet. ^- 

ZD 

An unsolicited acknowledge may be sent by the responder at any time. 
The requester's send queue simply recovers the credit field and the MSN 27 
from the most recently receive acknowledge packet. 28 

29 

Since an unsolicited acknowledge packet appears to the requester as a 3Q 
duplicate response, it has no effect on the requester other than the 
transfer of the credits. 

32 

C9-152: The MSN field of the unsolicited acknowledge packet must have 33 
a valid MSN field. 34 

35 

9.7.7.2.2 Negotiating Connections: Initial Credits 3g 

For each connection established, the use (or not) of end-to-end flow con- 37 
trol is established separately for each direction. The capabilities of the re- 33 
ceive queue determine the flow control characteristics for that half of the 
connection. 

40 
41 
42 
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C9-153: If the receive queue signals that it is expecting to generate 1 

credits, then the corresponding send queue must observe the end-to-end 2 

flow control rules. If, on the other hand, the receive queue signals that it 3 

will not generate end-to-end flow control credits, then the corresponding ^ 
send queue may transmit request messages at will without regard for 
credits. This is a requirement on each send queue of a CA. 

6 

C9-154: If a TCA*s receive queue does not generate End-to-End credits, 7 

it shall place the value 5b1 1 1 11 in AETH Syndrome[4:0] signalling that the 8 

credit field is invalid. g 



10 
11 



C9-155: When the receive queue is in the RESET state, the transport 
shall set the initial credit count to zero. Once the queue pair has transi- 
tioned to the INITIALIZED, RTR, SQD or RTS states, it shall increment its 
credit count for each receive WQE posted. 1 3 

14 

Once it is in the RTR, SQD or RTS states, the responder may transfer 1 5 
these credits to the requester by using unsolicited acknowledges. 



17 
18 



Normally an unsolicited acknowledge is created by re-sending the most 
recently sent positive acknowledge packet with an updated credit field. At 
initialization time however, no acknowledge packets have yet been sent 
so the normal method for creating an unsolicited acknowledge cannot be 20 
used. Therefore, at initialization time, an unsolicited acknowledge is ere- 21 
ated by subtracting "1" from the initial PSN. Thus, if the PSN is initialized 22 
to 0x000000 when the receive queue is in RESET state, then the PSN of 23 
the initial unsolicited acknowledge shall be OxFFFFFF. "Initialization time", 
in this context means the interval beginning when the receive queue has 
transitioned out of the RESET state and has not yet sent an acknowledge 
packet in either the RTR, SQD or RTS states. 26 

27 

To the send queue which receives this initial unsolicited acknowledge 28 
packet, it will appear as a "ghost" acknowledge packet Figure 95 Valid and 29 
Invalid Ack PSN Regions on page 284 . The requester's send queue may 
accept the MSN and credits contained in the unsolicited acknowledge 
packet but ignore the rest of the packet. This is an exception to the normal 
rules for ghost responses which require that ghost acknowledge packets ^2 
be dropped. 33 

34 

The above paragraph notwithstanding, responsibility for recovering initial 35 
credits from the responder shall lie with the requester; if the responder 
provides initial credits by using an unsolicited acknowledge, the requester 
may accept those as its initial credits in satisfaction of its responsibility to 
recover initial credits. 

39 

C9-156: If the responder does not provide initial credits, the requester 40 
shall behave as specified in Section 9.7.7.2.5 Reguester Behavior - Lim- 41 
ited Send WQEs on page 304 42 



30 
31 
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Figures Figure 99 Requester End-to-End Credit Processes on page 299 
and Figure 100 Responder End-to-End Credit Initialization Process on 
page 300 describe tliis behavior. Note that these figures do not depict all 
normal state transitions for the receive and send queues. These are fully 
specified in the Software Interface chapter. 



Init or Reset State 



Initialize 
Linnit Sequence Number 
and 

Send Sequence Number 
To zero 



RTS, RTR, or SQD State 
h 




Recover Credits, 
Update Limit 
Sequence Number^ 



RTS State 



WaitforWQEsto 
be posted 




yes 




Increment LSN 




Send the Request 
Increment SSN 



Follow the "Limited 
WQE" protocol 



1 ) Any Ack or unsolicited Ack with a valid MSN ~ 

2) I.e. is the message a "Send" or "RDMA Write with immediate"? 

3) LSN = AETH.MSN+AETH.credit 

Figure 99 Requester End-to-End Credit Processes 
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RESET state 



Initialize credits to 
zero 



no 




INIT state 



Increment credit 
count for each 
posted Receive 
WQE 




RTR State 





Increment credit 




count for each 


1 ► 


posted Receive 




WQE 


_ no _ 


^nbound^^ 



request? 



credits 
available?. 



**Responder may 
optionally return credits 
while in the RTR state 
using unsolicited acks. 



yes 



Return RNR NAK* 



♦Return RNRNAK 
only if the request 
would consume a 
receive WQE. If it does 
not, process the request 



Return Credits in 
response** 



^ initialization complete^ 



Figure 100 Responder End-to-End Credit Initialization Process 
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9.7.7.2.3 Responder Algorithm for Calculating Credits 1 

09-157: For an HCA using Reliable Connection service, if the receive 2 

queue generates end-to-end flow control credits, it shall incrennent its 3 

credit count for each WQE posted to the receive queue. It shall decrement 4 

its credit count for each inbound request nnessage received which con- ^ 
sumes a WQE. Thus, the responder does not adjust its credit count when 
it receives an RDMA READ request, an RDMA WRITE request without 

Immediate data or an ATOMIC Operation request. ^ 

8 

o9-96: If a TCA implements Reliable Connection service, and if the re- 9 

ceive queue generates end-to-end flow control credits, it shall increment -jq 
its credit count for each WQE posted to the receive queue. It shall decre 
ment its credit count for each inbound request message received which 
consumes a WQE. Thus, the responder does not adjust its credit count 
when it receives an RDMA READ request, an RDMA WRITE request 

without Immediate data or an ATOMIC Operation request. 14 

15 

09-158: For an HCA using Reliable Connection service, if the receive 16 

queue generates end-to-end flow control credits, for each acknowledge 17 
message generated, either a normal acknowledge message or an unso- 
licited acknowledge message, it shall insert its current encoded credit 

count as shown in Table 47 End-to-End Flow Control Credit Encodino on ^ 

page 302 . in AETH Syndrome[4:0]. For example, if the receive queue has ^0 

five credits available, it shall insert the 5 bit value bOOlOO in the AETH. It 21 

also includes its current MSN value. 22 

23 

o9-97: If a TCA implements Reliable Connection service, and if the re- 24 
ceive queue generates end-to-end flow control credits, for each acknowl- 
edge message generated, either a normal acknowledge message or an 
unsolicited acknowledge message, it shall insert its current encoded 

credit count as shown in Table 47 End-to-End Flow Control Credit En- 27 

codina on page 302 . in AETH Syndrome[4:0]. For example, if the receive 28 

queue has five credits available, it shall insert the 5 bit value bOOl 00 in the 29 

AETH. It also includes its current MSN value. 3q 

31 
32 

The presence or absence of credits limits the sender's ability to transmit 
requests which will consume a receive WQE (SEND requests or RDMA 
WRITE requests with immediate data). 34 

35 

09-159: The send queue's behavior when it has no credits available to it 36 
shall be as specified in Section 9.7.7.2.5 Reguester Behavior - Limited 37 
Send WQEs on page 304 . 33 

39 

The requester may always send a request which does not consume a re- 
ceive WQE (RDMA WRITE request without immediate data, RDMA ^0 
READ request, or ATOMIC Operation request) without regard to credits. 41 

42 
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C9-160: The requester shall not violate the normal transaction ordering 
rules as stated throughout this specification, particularly in Section 9.5 
Transaction Ordering on page 227 . 

In particular, the requester may not search the send queue looking for re- 
quests which don't consume a receive WQE and transmit those requests 
out of order, nor may the requester violate the rules governing fenced 
WQEs. 

The available credits are encoded and carried in AETH Syndrome[4:0]; 
the MSN is carried in the least significant 3 bytes of the AETH. Table 47 
below shows, for each valid encoded credit, the actual number of credits. 

Table 47 End>to-End Flow Control Credit Encoding 



Credit 


Valued added to 
MSN to get LSN 


Credit 


Valued added to 
MSN to get LSN 


00000 


0 


10000 


256 


00001 


1 


10001 


384 


00010 


2 


10010 


512 


00011 


3 


10011 


768 


00100 


4 


10100 


1024 


00101 


6 


10101 


1536 


00110 


8 


10110 


2048 


00111 


12 


10111 


3072 


01000 


16 


11000 


4096 


01001 


24 


11001 


6144 


01010 


32 


11010 


8192 


01011 


48 


11011 


12288 


01100 


64 


11100 


16384 


01101 


96 


11101 


24576 


01110 


128 


11110 


32768 


01111 


192 


11111 


Invalid 



Logically, the requester associates a sequential Send Sequence Number 
(SSN) with each WQE posted to the send queue. The SSN bears a one- 
to-one relationship to the MSN returned by the responder in each re- 
sponse packet. Thus, the requester interprets the MSN as representing 
the SSN of the most recent request completed by the responder. 

C9-161 : The encoded credit count returned by the responder in the AETH 
shall specify the number of receive WQEs posted to the responder's re- 
ceive queue relative to the MSN. 
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completed WQEs 



unlimited WQEs 



limited WQEs 



Since the MSN is directly related to the requester's SSN, the credit count 
is a simple offset into the send queue from the SSN of the most recent re- 
quest completed by the responder. Logically, the sum of the MSN plus the 
credit count is the requester's Limit Sequence Number (LSN). The re- 
quester may freely transmit any request whose SSN is less than or equal 
to the computed LSN. 

Any request whose SSN is greater than the current computed LSN is said 
to be limited. The send queue's behavior when it encounters a limited re- 
quest is as specified in Section 9.7.7.2.5 Requester Behavior - Limited 
Send WQEs on page 304 . 

Figure 101 Relating AETH values to the Send Queue on page 303 illus- 
trates the relationship between the values returned by the responder in 
the AETH and the requester's send queue. 

Figure 101 Relating AETH values to the Send Queue 



request 



SSN 















request 22 


00 00 15 


request 23 


00 00 16 


request 24 


00 00 17 


request 25 


00 00 18 


request 25 


00 00 19 


request 27 


00 00 lA 


request 28 


00 00 IB 


request 29 


00 00 IC 


request 3U 


00 00 lU 


request 3 1 


00 00 IK 


request 32 


00 00 IF 










request n 


XX XX XX 




Request 32 is "limited" 
since its SSN is greater 
than the current computed 
LSN. 



Requester's Send Queue 



The requester calculates a new LSN each time it receives an acknowl- 
edge packet containing valid credits. The requester also dynamically ad- 
justs the LSN by adding one to it for every request it wishes to send that 
does not consume a receive WQE (RDMA READ requests, RDMA 
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WRITE requests without immediate data, or ATOMIC Operation re- 
quests). This adjustment is the mechanism which allows the requester to 
send requests that do not consume a receive WQE. 

Any given implementation is not required to implement the LSN and SSN 
mechanisms described above, but must conform semantically to the be- 
havior described. 

9.7.7.2.5 Requester Behavior - Limited Send WQEs 

C9-162: When the requester encounters a WQE on its send queue for 
which it has no available credits, that WQE is said to be limited. The send 
queue's behavior when it encounters a limited WQE shall be as follows: 

• If the limited request WQE is an RDMA READ request, an RDMA 
WRITE request without immediate data, or an ATOMIC Operation re- 
quest, it may be sent normally without regard to the availability of 
credits. The normal rules for ordering of requests still hold (i.e., the 
send queue may not search through the list of posted WQEs in an at- 
tempt to find unlimited WQEs to be sent out of order). After sending 
such a request, the requester increments its computed LSN valued 
since the sent request does not consume a receive WQE and thus 
does not consume a credit. 

If the limited request WQE is a SEND request, the send queue shall 
transmit no more than a single packet of the request message before 
it must stop transmission and wait for an acknowledge packet. To en- 
sure that the responder will generate a response, the requester shall 
set the AckReq bit in that single packet. 

If the limited request WQE is an RDMA WRITE request with immedi- 
ate data, the requester may transmit the entire request message be- 
fore it must stop transmission and wait for an acknowledge packet. 
This is permitted because it is the single packet containing immediate 
data of the request that actually consumes the receive WQE. To en- 
sure that the responder will generate a response, the requester shall 
set the AckReq bit in the last packet of the request message. 

C9-163: For an HCA using Reliable Connection service, if the limited 
WQE is a SEND request, the send queue shall transmit no more than a 
single packet of the request message. Within this single packet, the Ac- 
knowledge Request (AckReq) bit of the BTH shall be set. The requester 
shall then stop transmission and wait for an acknowledge packet. 



1 . An interesting situation can occur that artificially limits the sender LSN with 
certain message patterns; if the sender does Send, RDMA, RDMA, RDMA with 
two credits from the receiver, it will increment the LSN by three. If after that, the 
response arrives with MSN+1 credit, the LSN will then be set back by two, 
putting the requestor into limit until the Ack from the RDMA's arrive. 
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9.7.8 Reliable Datagram 



o9-98: If a TCA implements Reliable Connection service, and if the limited 
WQE is a SEND request, the send queue shall transmit no more than a 
single packet of the request message. Within this single packet, the Ac- 
knowledge Request (AckReq) bit of the BTH shall be set. The requester 
shall then stop transmission and wait for an acknowledge packet. 

o9-99: In an HCA using Reliable Connection service, or if a TCA imple- 
ments Reliable Connection service, and if the limited request WQE is a 
RDMA WRITE request, the requester may transmit the entire request 
message before it must stop transmission and wait for an acknowledge 
packet. To ensure that the responder will generate a response, the re- 
quester shall set the Acknowledge Request (AckReq) bit in the last packet 
of the RDMA WRITE request. 

09-164: Since the responder's receive queue may generate an unsolic- 
ited acknowledge message at any time, the requester shall be prepared 
to receive an unsolicited acknowledge message from the responder at 
any time, provided that the receive queue has signalled that it will gen- 
erate end-to-end flow control credits. 

An unsolicited acknowledge is used solely for the purpose of transferring 
credits from the responder to the requester. On receiving an unsolicited 
acknowledge, the requester recalculates its LSN as specified above and 
responds accordingly. 

A lack of credits does not impact a requester's ability to re-transmit previ- 
ously transmitted requests as part of its recovery from lost packets. End- 
to-end credits only limit the transmission of new request messages. For 
example, if the requester detects a timeout condition after having sent a 
single packet of a limited SEND request, it decrements its timeout retry 
counter as usual and retransmits the request. 



Reliable Datagram provides reliable communication, i.e. the same level of 
reliability and error recovery as for Reliable Connection, using a one-to- 
many paradigm. A requestor's send queue may send sequential mes- 
sages to different responders, at different QPs on the same or different 
nodes. A responder QP may receive messages from multiple requesters 
on the same or different endnodes. As with the Unreliable Datagram 
transport service, the source endnode and source QP are provided to the 
responder. 

The motivation for using Reliable Datagram is to economize the QP name 
space for applications that engage in "all to all" communication. Consider 
N processor nodes, each with M processes. If all M processes wish to 
communicate with all the processes on all the nodes, a Reliable Connec- 
tion service requires M^*(N-1 ) QPs on each node. By comparison, the Re- 
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As with Reliable Connection, the local QPs to use for this service are es- 
tablished in the RD service mode by the application prior to use. Remote 
QPs are chosen in an application dependent manner. 



5 
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liable Datagram service only requires M QPs + N "end-to-end" (EE) 1 
connections on each node for exactly the same communications. 2 

3 

Reliability is implemented using at least one "QP-like" context for each re- ^ 
mote endnode - this is referred to as the End-to-End Context (EE Context 
or EEC). This context provides the information needed to locate the re- 
mote node, to serialize and exchange acknowledgments, and maintain re- ^ 
liability. 7 

8 

The Service still uses the QPs to provide the queues, WQE pointers, pro- g 
tection checking parameters etc. Together, a QP and an EEC contain the 
information needed to reliably move messages to a destination. But many 
QPs may use a single EEC for sending or receiving, and a QP may com- 
municate through several different EECs, one chosen with each mes- 
sage. 13 

14 

When an application determines the target that it is to communicate with, 1 5 
it must first establish (or use an already established) an EE Context. ^ g 

17 
18 
19 
20 

Once the EE Context is created, the client may send a message to the re- 21 
sponder QP via this EE Context. The client must specify the EE context 22 
handle (handle that defines the destination endnode), the responder QP, 23 
the Q_Key, and any additional message parameters. The implementation 
then "multiplexes" the messages from each source QP to the appropriate 
EEC, and sends the message. When the message arrives at the destina- 
tion, the implementation uses the EE Context to validate the packet and 26 
"de-multiplexes" the message to the appropriate QP. 27 

28 

The Reliable Datagram service uses the methods (PSNs, ACK/NAK pro- 29 
toco! etc.) as described previously in 9.7 Reliable Service on oaoe 238 . 

31 
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9.7.8.1 Reliable datagram Characteristics 



Table 48 Reliable Datagram QP characteristics 







Corrupt data detected 


Yes 


Data delivered exactly once (Except for an unrecover- 
able error; that is reported to the application) 


Yes 


Data order guaranteed to same destination QP from 
the same source QP 


Yes 


Data order guaranteed to different destinations from 


Yes 


Scalability (number of messages) on the service 


Limited to number of EE Con- 
lexis in use ueiween enunoues 


LfaXa lOoo UclcdcU 




DPtMA DPAH Qi inrirtrt 
rvUlVIM rxtMU oUppOn 


Voc 

Yes 


DRMA WRITP Qiinnnrt 
rxL^IVIM Vvr\l 1 C oUppOR 


TcS 


pleted 




State of in-flight SEND/RDMA WRITE when unrecov- 
erable error occurs 


First one unknown, others not 
delivered 


ATOMIC Support 


Optional 


Multi-packet message support 


Yes 


Multiple EE Context allowed between end-nodes (to 
provide traffic segregation for QOS) 


Yes 


Single SL / QoS assigned to EE Context 


Yes 


Number of messages In-flight per EEC 


1 


Number of messages in flight per QP 


1 


Number of messages enqueued per EEC / QP 


Implementation limited only 


Number of packets allowed in flight (architectural) 


223 


RD QP shall only communicate with RD QPs 


Yes 


Partition Key verification 


On a per EEC basis 


Protection verification (e.g. Q_Key, R_Key, etc.) and 
Addressing 


On a per QP basis 


Max Size of messages 


231 


Destination QP, Q_Key, and address supplied 


On a per send WR basis 


Source QP and address supplied 


On a per receiver completion 
basis 



o9-100: CA's claiming to support RD mode shall provide QP's capable of 
supporting RD. When operating in RD mode, these QPs allow sending se- 
quential RD messages to different responders, at different destination 
QPs on the same or different nodes. When operating in RD mode, these 
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QPs shall be capable of receiving RD messages from multiple requesters 1 
on the same or different endnodes. 2 

3 

O9-101 : CA's claiming to support RD mode shall provide EEC's that allow ^ 
the "multiplexing" of multi packet RD message traffic to and from multiple 
QPs while maintaining reliability (messages are delivered from a re- 
quester to a responder at most once, in order and without corruption, or 6 
the upper layer is notified.) 7 

8 

O9-102: CA's claiming to support RD mode shall ensure that an RD mes- g 
sage has been completed at the sender (fully acknowledged or completed 
in error) before sending another message on the same EEC. 



17 
18 



10 
11 

O9-103: CAs claiming to support RD mode shall ensure that an RD mes- 
sage has been completed at the sender (fully acknowledged or completed ^ 3 
in error) before sending another message on the same QR 14 

15 

O9-104: CAs claiming to support RD mode shall meet the requirements 
specified in 9.2.1 Operation Code (OpCode)on page 199 for coding of the 
RD OpCodes, 9.3.1 Reliable Datagram Extended Transport Header 
(RDETH) - 4 Bvtes on page 203 for creation of that header, and 9.6 Packet 
Transport Header Validation on page 228 and 9.7 Reliable Service on ^9 
page 238 through 9.7.6 for reliable transports for processing RD mes- 20 
sages. 21 

22 

o9-105: CAs claiming to support RD mode shall meet the requirements 23 
specified in 9.9 Error detection and handling on page 336 while pro- ^4 
cessing RD messages. 

O9-106: CAs claiming to support RD mode shall provide support for set- 26 
ting up connections between EECs as defined in Chapter 12: Communi- 27 
cation Management on page 516 . using the Management facilities as 28 
defined in Chapter 13: Management Model on page 564 29 



30 
31 



O9-107: CAs claiming to support RD mode shall ensure that RD message 
errors or events that are not associated with the underlying EE Context 

(for example Q_Key or R_Key violations or RNR-NAK) shall not cause ^2 

that EE Context to shut down or prevent the EE Context from processing 33 

other RD messages destined to other QPs. 34 

35 

O9-108: HCAs claiming to support RD mode shall provide support for og 
Send, RDMA WRITE, RDMA READ, and ATOMICS in RD mode to the ex- 
tent defined and reported in 11.2 Transport Resource Management on 

page 449 . 38 

39 

O9-109: HCAs claiming to support RD mode shall provide support for 40 

EEC management as defined in 10.2.6 End-to-End Contexts on page 378 41 

and 11.2.6 EE Context on oaoe 474 . 42 
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O9-110: HCA's claiming to support RD mode shall provide support for 1 

RDD domains as defined in 10.2.7 Reliable Datagram Domains on page 2 

379 . 11.2.1.7 Allocate Reliable Datagram Domain on paoe 455 . and 3 

11.2.1.8 Deallocate Reliable Datagram Domain on page 455 . ^ 

5 

Implementation note: For many implementations, an EEC will actually 

be a special "mode" of a general QP or EE context. For these implemen- ^ 

tations, the context number specified as a destination EEC must be set up 7 

in Reliable Datagram 'EEC mode. Reliable Datagram packets arriving at 8 

a context (identified by the EE Context field in the header) that is not set g 
up to "EE Context" mode, shall be silently dropped. 



10 
11 
12 



17 
18 



The responder QP context must be set to support Reliable Datagram 
transport service. If a Reliable Datagram packet arrives at a QP context 

that is not configured for RD operation, the responder shall respond with 13 

a "NAK Invalid RD Request". 1 4 

15 

An important distinction for this service is that errors that are not associ- -^g 
ated with the underlying EE Context do not result in shutting that EE Con- 
text down. Examples of these would be Q_Key or R_Key violations. 
Similarly, the Receiver Not Ready (RNR NAK), caused by resources as- 
sociated with the receiver's QP does not prevent the EE Context from pro- ^ 9 
cessing other messages destined to other QPs. 20 

21 

Errors that are associated with the EE Context (retry limit exceeded, etc.), 22 

detected during a message transmission or reception, shall be reported in 23 

the WR completion. 24 

Errors associated with the requestor or responder QP shall be reported in 
the WR completion with the usual error semantics. See 9.9 Error detection 26 
and handling on page 336 for a more complete discussion on errors. 27 

28 

Since an end-to-end credit mechanism is not practical in a "connection- 29 
less" type of service, responders shall send a NAK Receiver Not Ready 
response if a requester's SEND arrives while the responder's Receive 
Queue is empty. See 9.7.5.2.8 RNR NAK on pace 281 for additional de- 
tails. 32 

33 

To preserve the ordering rules required of this service, and to keep the de- 34 
sign complexity down, messages on this service are sent one at a time 35 
from the source QP with the requirement that each message acknowledg- 
ment be received at the requesting QP before the next message can be ^7 
started. 

38 
39 
40 
41 
42 
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Processor 1 



Two views of "connectionless", Reliable Data- 
gram service. The figure to the right shows a soft- 
ware view of Reliable Datagram communication . 
among 4 processes on 3 processors. In this ex- 
ample, there is no communication among pro- 
cess E and processes C and D, otherwise, 
Process A can send to and receive from all the 
other processes. 



The lower parts of the figure shows the multiple 
EE Contexts used by the CA to synthesize the Reliable Datagram ser- 
vice. Each context shows some of the state it uses to "connect" to the 
others. 



The SendQ 4 state shows the destinations of three messages; the ReceiveQ 
states show the Queues after successful transmission of those messages. 



Processor 2 



Processor 1 




Receivj 
HCA DLID = 54 



Process A QP=4 



Rev Buff 



Send 
A 



RcvQ 
? 



Rev Buff 



Rev Buff 



Send 



SendQ 



SendQ 4 State: 
Msg 0: DLID=27, QP=24 
Msg 1: DLID=27, QP=25 
Msg 2: DLID=54. QP=14 



HCA (DLID = 33) 




EE27 State: 
DLID = 27. 
XMit PSN = 77 
Rev ePSN = 66 



EE54 State: 
DLID =54 
XMit PSN = 55 
Rev ePSN = 44 



Processor 2 



Processor 3 



Process E QP=14 



Rev Buff 



Send: 



RcvQ^ 



RcvBuff 



.RcvBuff 

3 ^ 



EE33 State: 
DLID = 33. 
XMit PSN = 44 
Rev ePSN = 55 



SendQ 



RcvQ 14 State: 
Msg 0: SLID=33, QP=4 
Msg 1: not filled 
Msg 2: not filled 
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9.7.8.2 Example RD Operations 1 

The following is not normative material, but is included to clarify this topic. 2 

These examples are based on an HCA implementation; other implemen- 3 

tations are possible. TCAs, for instance may not utilize virtual memory and 4 

may modify other details of this example. 5 

6 
7 

It also maintains a linked list of Send QPs, anchored at each EEC. This ^ 
list contains QPs, each of which has a WR at the head of its Send Queue 9 
that is destined for the EEC. 1 0 

11 

In order to manage the orderly transmission of packets and messages, the ^ 2 
implementation uses a scheduler. This scheduler maintains a list of those ^ ^ 
EEC that have packets to send. As each EEC gets to the head of the 
scheduler list, one or more packets are sent (depending on QoS and other ^ ^ 
factors not important here). ^ 

16 

On the responder side, the implementation example maintains a standard 1 7 
set of Receive Queues. 1 g 

19 

The implementation also maintains space in each EEC to copy those pa- 
rameters needed to process a single incoming message. These parame- 
ters are copied both from the QP (PD, CQ, Q_Key etc.) and the receive ^1 
WQE (data segment L_Key, Virtual address, size etc.). The EEC then has 22 
enough information to process the entire message to completion with no 23 
further reference to the WQE or QP, even if the message contains many 24 
packets. 25 



9.7.8,2.1 Example Outbound Request 



the list of data segments (virtual address, L_Key and length) that 
describes the send message 

the destination "EE Context number" 



26 
27 



1 ) The client of the Reliable Datagram posts a send message (de- 
scribed by WQE) to the send queue of its QP. This consists of: ^8 

29 

30 
31 
32 

• the destination QP number. 33 

• the destination Q_Key 34 

2) When the WQE reaches the head of the Send Queue (found by 35 
pointer from the QP context), the EE Context is located from the 36 
WQE and the QP is "linked" to the EECs "QP list" for processing 37 
(EEC contains enqueue and dequeue pointers, each QP contains link 33 
pointer to next QP to run. Take QP at enqueue pointer, update its 
"next" link to point to the new QP, adjust enqueue pointer to the newly 
linked QP). 

41 
42 
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If the EEC is not currently sending messages, the EEC is also placed 1 

into the scheduler. 2 

3) When the EEC is scheduled to send a message, the HCA locates the 3 
WQE parameters by accessing the QP at the head of the EEC's "QP 4 
list" (Take QP found at the Dequeue pointer) and using the QP's work 5 
queue pointers. 5 

4) HW uses the memory protection parameters of the enqueuing 7 
process (stored with the QP Context) and the virtual address etc. a 
from the WQE. This allows the Send Queue HW to directly access g 
the virtual address space of each process that posts send message 
buffers. 



10 
11 

5) The HCA hardware reads the data buffer, builds the transport header -j 2 
(including the "Packet Sequence" number associated with the EE 
Context) and puts the packet onto the wire. 



13 
14 

6) This process is repeated from step 3 until the entire message is sent. ^ 5 
The "EE Context" is serviced according to the same scheduling algo- 
rithm used for Reliable Connection QPs. 

17 

7) When a message is completely sent, the CA waits until ail Acknowl- 
edgments are in for the message. ^ g 

8) Since the EEC must wait for a message ACK before continuing (only 20 
a single message outstanding at once), the EEC is scheduled with an 21 
appropriate timeout and the EE Context is updated. 22 

9) When the last ACK has arrived and the WQE completed, the HCA 23 
determines if there are additional WQEs posted to the current QP 24 
(the one at the head of the EEC's QP list). If so, the next WQE is ex- 25 
amined to locate the EEC for the QP's next message (this may be to 

a different EEC than the current). The CA then dequeues the QP 

from the current EEC's QP list, and enqueues it on the tail of the next 27 

message's EEC QP list. This is similar to step 2 above. 28 

OQ 

10) The EEC's QP list is examined to determine if any QP has work for 

this EEC. 30 



11) The process repeats from step 3 until no more messages are 
available to send. At this point, the EEC is removed from the 
scheduler and set into an "inactive" state. 33 



31 
32 



34 
35 
36 



9.7.8.2.2 Example Inbound Request 

The inbound request needs to access the QP state associated with the re 
spender's Receive Queue, the receive WQE, and the EE Context that 
maintains information about the source. Both QP and EEC are available 3^ 
in the header for this purpose. 38 

39 

The following lists the steps taken by the HCA to process an incoming re- 49 
quest packet: 4^ 

42 
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First or Only Packets 



Middle or Last packets 



1 ) The incoming request packet arrives and is found to be un-corrupted 
and the first or only packet of the message. 

2) The packet header specifies the destination QP number. This is the 
QP associated with the client of the Reliable Datagram service. This 
QP points to the receive queue, and a WQE, but does not have any 
sequence number information. The packet header also includes the 
"EE Context number" that is used to access the EE Context. The se- 
quence number information is stored with the EE Context connected 
to the requesting host. 

3) The incoming request's sequence number is compared against the 
state of the EE Context connected to the requesting node. 

4) If the sequence number and other packet contents are correct, the 
destination QP's memory protection and WQE entry information are 
temporarily copied to the EE Context. This implementation is useful 
because other EECs may be targeting the same QP and other mes- 
sages will end up in progress to the same QP. By copying the WQE 
and memory related information to the EEC, the QP is free to point to 
subsequent WQEs for additional messages. This is also the reason 
that receive WQEs may complete out of order 

5) The memory protection checks are done, and if the receive buffer is 
valid, the incoming request is written to memory (or in the case of a 
RDMA READ, stored for later processing). 

6) The CA puts the EEC on the scheduler to send an ACK response. 

7) If the packet was an "only", then the CA completes the message 
using the EEC's copy of WQE and QP values, with no additional ref- 
erences to the QP or WQE. 

1 ) For subsequent packets from the same message, only the EE Con- 
text is accessed based on the header EEC number. This allows other 
EECs to utilize other WQEs from the QP Receive Queue indepen- 
dently. 

2) If the sequence number and other header checks are correct, the 
memory protection checks are done, and if the receive buffer is valid, 
the incoming request data is written to memory. 

3) The CA puts the EEC on the scheduler to send an ACK response. 

4) If the packet was a "last", then the CA completes the message using 
the EEC's copy of WQE and QP values, with no additional references 
to the QP or WQE. 



9,7.8.2.3 Example Outbound Acknowledge 



When the EEC gets to the head of the scheduler queue, the CA notes that 
an ACK must be sent, and sends it. If multiple packets have arrived before 
the EEC gets to the head of the scheduler, this creates a coalesced ACK. 



1 
2 
3 
4 
5 
6 
7 
8 
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21 
22 
23 
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27 
28 
29 
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The EE Context's last valid receive sequence number is sent in the ACK 1 
packet per the ACK/NAK rules. 2 

3 

If the operation was an RDMA read, then nDultiple response packets may ^ 
be required. In this case, the EEC is placed back on the scheduler after 
each packet until the PSN of the responses reaches the expected PSN. 



9.7.8.3 Reliable Datagram Operations 



9.7.8.2.4 Example Inbound Acknowledge 7 

A returning ACK response indicates a request packet was successfully 8 

completed. When the ACK arrives, the EE Context is examined and the 9 

returned PSN is checked. If this is the expected (next sequential) ACK, -|o 
the expected PSN is updated. If this is the last ACK of a message and all 
previous packets were acknowledged, then the message can be com- 
pleted using the EEC's copy of the QP and WQE information. If the ACK 
is not sequential, then the usual coalesced ACK rules apply. Since only a 

single message is outstanding at one time, only a single message is ever 14 

acknowledged at one time. 1 5 

16 

For RDMA READs, the CA uses the EEC's copied QP protection informa- ^ j 
tion and WQE data segment information to store the data. 



11 
12 



19 

20 
21 



The processing is very much the same as defined for Reliable connection 
service. The significant difference is for the treatment of repeated packets 

at the responder, and the rules for repeating a request at the requester. ^2 

The differences are highlighted in italics. 23 

24 

9.7.8.3.1 SEND AND RDMA WRITE WITH IMMEDIATE Data processing 25 

SENDS and RDMA WRITES with Immediate data are handled in the same 26 

way as for Reliable Connection service, except ttiat end-to-end credits are 27 

not returned to the sending QR „ _ 

Zo 

09-111 : CAs claiming support for Reliable datagram service shall use the 
NAK-RNR protocol to indicate an over-run of the Receive Queue for RD 30 
messages. 31 

32 

9.7.8.3.2 RDMA READ PROCESSING 33 

RDMA READS are handled in the same way as for Reliable Connection 34 

service. Incoming requests are stored at the responder's "hidden re- 3^ 
sources", attached to the EE Context, and memory protection information 
is accessed or copied from the QP Contexts. Unlike Reliable Connection 

sen/ice, the number of RDMA READ request messages outstanding from 37 

a single QP or EEC shall be limited to one. 38 

39 

9.7.8.3.3 Atomics processing 40 

Atomics are handled in the same way as for Reliable Connection service. 4^ 
Incoming requests are stored at the responder's "hidden resources", at- 42 
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9.7.8.4 Ordering Rules 



9.8 Unreliable Service 



tached to the EE Context, and memory protection information is accessed 
or copied from the QP Contexts. Unlike Reliable Connection service, the 
number of ATOMIC requests outstanding from a single QP or EEC shall 
be limited to one. 



Receive Queues are FIFO queues. Once enqueued, WQEs shall begin 
processing in FIFO order, but may be completed out of order. The mes- 
sages from any single source QP shall always be in order. 

o9-112: CA's claiming to support RD mode shall provide upper layer sup- 
port for out of order receive queue completion for RD messages. 

Send queues are FIFO queues. Once enqueued, WQEs shall be pro- 
cessed for sending in the order they were enqueued. 

09-113: CA's claiming to support RD mode shall ensure that WQEs on the 
Send Queue in RD mode are completed in order whether they are tar- 
geting different destination QPs on the same or a different endnode or the 
same destination QP. The completions for WQEs shall always be returned 
to the transport consumer in FIFO order. 

This does not mean that the implementation must place the data portions 
of the messages in memory in any particular order. As a result, the arrival 
order is not guaranteed until the message is marked complete on at least 
one side. An application shall expect that memory buffers are undefined 
until the message is completed. 

Note that items queued on different QP's Send Queues on the same HCA 
for the same destination endnode or even the same destination QP are 
not ordered with respect to each other. For example, if WOE 'A' destined 
for destination'X' and QP "75" is posted to QP 1, and WOE 'B' destined 
for destination'X' and QP "75" is later posted to QP 2 of the same CA, 
there is no guarantee that 'A' will arrive before 'B' at the destination. 

©9-114: For CA's claiming to support RD mode, upper layers must tolerate 
lack of ordering among RD messages from different send QPs. That is, 
items queued on different QP's Send Queues on the same HCA for the 
same destination endnode or even the same destination QP are not or- 
dered with respect to each other. 



IBA defines two types of unreliable service: Unreliable Connection 
(SEND, RDMA WRITE) and Unreliable Datagram (SEND only). These 
services have the following characteristics: 
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1 ) Requester receives no acknowledgment of message receipt 1 

2) No packet order guarantees 2 

3 

4 



3) Responder validates incoming packets as normal (validates appro- 
priate header fields, CRC checks). A corrupted packet may be si- 
lently dropped, causing the message to be dropped. ^ 



9.8.1 Validating and Executing Requests 



4) On detecting an error in an incoming packet such as a dropped / out 
of order packet, the responder does not stop, but continues to re- 
ceive incoming packets. ^ 



9 
10 
11 



5) Responder considers the operation complete once it has received a 
complete message in correct sequence, all data has been committed 
to the local fault zone, and all appropriate validity checks (including 
variant and invariant CRC checks) have been completed. For Unre- ^'^ 
liable Connected service, the definition for a completed message is 1 3 
given in section 9,8.2.2.7 on page 329 14 

6) Requester considers a message operation complete once the "last" 1 ^ 
or "only" packet has been committed to the fabric. 1 6 



17 
18 
19 



This section applies to both unreliable connection and unreliable data- 
gram services. Where there are differences between the services, those 

differences are noted. The major differences between the two services 20 

are due to the fact that Unreliable Datagram service is restricted to single 21 

packet messages whereas Unreliable Connected service does not have 22 

this restriction. In addition, Unreliable Datagram service is restricted to 23 

using the Send function further simplifying the request validation process. 24 

25 

The following describes the requirements placed on a responder for vali- 
dating an inbound request packet. 

27 

09-165: The responder shall validate the various fields of the headers in 28 
order to verify the integrity of the packet. This validation process is speci- 29 
fled in Section 9.6 Packet Transport Header Validation on page 228 . 3Q 
Packets containing invalid fields shall be silently dropped by the re- 
sponder. 

32 

09-166: For an HCA using Unreliable Connected service, the PSN shall 33 
be examined to detect out of order packets. By examining the PSN, the 34 
responder can determine whether the packet is a new request or an in- 35 
valid packet. See Section 9.8.2.2.1 Responder - Validating the PSN on 35 
pace 322 for a description of this check. 

38 

09-115: If a TCA implements Unreliable Connected service, the PSN shall 
be examined to detect out of order packets. By examining the PSN, the ^9 
responder can determine whether the packet is a new request or an in- 40 
valid packet. See Section 9.8.2.2.1 Responder - Validating the PSN on 41 
page 322 for a description of this check. 42 
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C9-167: For an HCA using Unreliable Connected service, the responder 1 

shall examine the packet OpCode to determine that the packet OpCode 2 

sequence is valid. This check is not applicable to Unreliable Datagram 3 

since that service is restricted to single packet messages, thus the con- ^ 
cept of a sequence of opcodes is not applicable. 

5 

o9-116: If a TCA implements Unreliable Connected service, the re- ^ 

sponder shall examine the packet OpCode to determine that the packet 7 

OpCode sequence is valid. This check is not applicable to Unreliable Da- 8 

tagram since that service is restricted to single packet messages, thus the 9 

concept of a sequence of opcodes is not applicable. ^ q 

1 1 

09-168: The responder shall examine the packet OpCode to determine 
whether the requested operation is supported by this receive queue. ^ ^ 

13 

09-169: The responder shall verify that it has sufficient resources avail- 14 
able to receive the message. 1 5 

16 

09-1 70: For an HCA responder using Unreliable Connection service, if 
the request is for an RDMA WRITE operation, the responder shall ex- 
amine the R_Key. If the packet is found to be valid, in order, and sufficient 
resources are available, it is executed by the responder. In the process of ^ ^ 
execution, the responder may encounter local errors. 20 

21 

09-117: If a TCA responder implements Unreliable Connection service, 22 
and if it supports RDMA operations, it shall behave as follows. If an in- 23 
bound request is for an RDMA WRITE operation, the responder shall ex- 
amine the R_Key. If the packet is found to be valid, in order, and sufficient 
resources are available, it is executed by the responder. In the process of 
execution, the responder may encounter local errors. 26 

27 

09-1 71 : For an HCA responder using Unreliable Connection or Unreliable 28 
Datagram services, or for a TCA responder using Unreliable Datagram 29 
service, the responder shall follow the sequence shown in Figure 103 
when validating an inbound request packet. 

09-118: If a TCA responder implements Unreliable Connection service, ^2 
the responder shall follow the sequence shown in Figure 103 when vali- 33 
dating an inbound request packet. 34 

35 

For Unreliable Connected service, these requirements are discussed in 
some detail in section 9.8.2.2 Responder Behavior on page 322 . Packet 
validation for Unreliable Datagram service is discussed in 9.8.3.2 Re- 
sponder Behavior on page 332 . 38 

39 

40 
41 
42 
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Figure 103 Unreliable Service: Inbound Packet Validation 
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9.8.2 Unreliable Connections 



An unreliable connection consists of a one-to-one correspondence be- 
tween two QPs. Packets are sent fronn one QP to the other but no ac- 
knowledgments are generated by the destination QP. The chief 
characteristics are that there are no delivery guarantees made to the re- 
quester. The responder, however can detect data corruption and out of 
order packets. 

The characteristics of Unreliable Connection service are summarized in 
Table 49. 

Table 49 Summary of Unreliable Connection Service 
Characteristics 



Characteristic 


Comment 


Delivery guarantee 


No guarantees to the requester, Responder may drop messages. 


Ordering-requester 


No guarantee. Requester cannot rely on msgs arriving in order. 


Ordehng-responder 


Responder detects and drops out of order packets. 


Ordering-responder 


Dropped packets may cause the message to be dropped. 


Ordehng-responder 


After dropping a packet, responder resumes with the first packet of 
a new message. 


Supported Operations 


Sends and RDMA WRITES (with and without Immediate data) 


Message size 


Maximum 2^ bytes. Msgs may comprise multiple packets. 



9.8.2.1 Requester Behavior 



This section specifies tfie requester's required behavior when generating 
request packets for Unreliable Connection service. 



9.8.2.1.1 Requester - Generating PSN 

C9-172: For an HCA requester using Unreliable Connection service, the 
requester must place a value, called the current PSN, in the BTH:PSN 
field of every request packet. 

09-119: If a TCA requester implements Unreliable Connection service, 
the requester must place a value, called the current PSN, in the BTH:PSN 
field of every request packet. 

During connection establishment, the transport layer's client must pro- 
gram the next PSN to any value between zero and 16,777,215. 

09-173: For an HCA requester using Unreliable Connection service, the 
initial PSN, as programmed by the transport layer's client, shall appear as 
the BTH:PSN in the first request packet generated by the requester. 
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O9-120: If a TCA implements Unreliable Connection service, the initial 1 
PSN. as programmed by the transport layer's client, shall appear as the 2 
BTH:PSN in the first request packet generated by the requester. 3 

4 

C9-1 74' using Unreliable Connection service, the transport layer shall modify 

(update) the PSN only when the send queue is in a proper state to transmit request packets. 
For example, for an HCA, the transport layer does not update the next PSN while the queue 6 
pair is in the INITIALIZED state. j 

8 
9 
10 

C9-175: For an HCA using Unreliable Connection service, each request 11 
packet generated by the requester must have a PSN value that is an in- 12 
crement of "1" (modulo 2^^^ of the PSN value of the preceding request 
packet. 

1 5 

o9-1 22: If a TCA implements Unreliable Connection service, each request 
packet generated by the requester must have a PSN value that is an in- 
crement of "1 " (modulo 2^^) of the PSN value of the preceding request ^ ^ 
packet. 1 8 

19 

Table 50 Requester's Calculation of Next PSN 20 



o9-1 21 : implements Unreliable Connection service, the transport layer shall modify 

(update) the PSN only when the send queue is in a proper state to transmit request packets. 



Current Request 
Packet 


PSN for Next Request Packet 


SEND. RDMA WRITE 


current PSN + 1 (modulo 2^"*) 



21 

22 
23 
24 

9.8.2.1.2 Requester - Generating Opcodes 25 

The opcodes generated by a requester must fit into a schedule of opcodes 26 
as shown below. 27 

28 
29 
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C9-176: For an HCA requester using Unreliable Connection service, the 
Table 51 Schedule of Valid OpCode Sequences 



Previous Packet OpCode 


Valid Opcodes for Current Packet 


None e.g., first packet following 
connection establishnnent 


"First" packet 
"Only" packet 


"First" packet 


"Middle" packet (message is 3 or more packets) 

"Last" packet (message is exactly 2 packets) 

Type of operation must match the previous OpCode 


"Middle" packet 


"Middle" packet 
"Last" packet 

Type of operation must match the previous OpCode 


"Last" packet 


"First" packet (1st packet of a new message) 

"Only" packet (1st packet of a new single packet msg) 


"On!y" packet 


"First" packet 
"Only" packet 



requester must generate packet opcodes which fit within the schedule of 
valid OpCode sequences as shown in Table 51 Schedule of Valid OoCode 
Sequences on oaae 321 . When generating a request packet, the 
BTH:Opcode shall be as specified in Table 35 OpCode field on page 200 . 

09-123: If a TCA requester implements Unreliable Connection service, 
the requester must generate packet opcodes v/hich fit within the schedule 
of valid OpCode sequences as shown in Table 51 Schedule of Valid Op- 
Code Seouences on page 321 . When generating a request packet, the 
BTH:Opcode shall be as specified in Table 35 OpCode field on page 200 . 



9.8.2.1.3 Requester - Generating Payloads 



The requester shall generate payload lengths as a function of the opcode 
as follows: 

09-177: For an HCA using Unreliable Connection service, if the OpCode 
specifies a "first" or "middle" packet, then the packet payload length must 
be a full PMTU size. 

09-178: For an HCA using Unreliable Connection service, if the OpCode 
specifies a "only" packet, then the packet payload length must be between 
zero and PMTU bytes in size. Thus, the only way to create a zero byte 
length transfer is by use of a single packet message. 

09-179: For an HCA using Unreliable Connection service, if the OpCode 
specifies a "last" packet, then the packet payload length must be between 
one and PMTU bytes in size. 
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o9-124: If a TCA implements Unreliable Connection service, then it shall 1 
conform to the three preceding HCA requirements for OpCode. 2 

3 
4 

C9-180: For an HCA requester using Unreliable Connection service, the ^ 
requester shall consider a message Send (or RDMA WRITE) complete 
when either of the following conditions occurs: The requester has com- 
mitted the last byte of the VCRC field of the last packet to the wire (and 
detected no local errors associated with the message transfer), or the re- 8 
quester has detected a local error associated with the message transfer 9 
that causes the requester to terminate sending the request. 1 0 



9.8.2.2.1 Responder - Validating the PSN 



6 
7 



11 

12 



09-125: If a TCA requester implements Unreliable Connection service, 
the requester shall consider a message Send (or RDMA WRITE) com- 
plete when either of the following conditions occurs: The requester has 
committed the last byte of the VCRC field of the last packet to the wire 14 
(and detected no local errors associated with the message transfer), or 1 5 
the requester has detected a local error associated with the message 1 6 
transfer that causes the requester to terminate sending the request. ^ 7 



18 
19 



Note that at the time that the requester completes the send WQE, the 
state of the memory at the responder is unknown. Likewise, if the re- 
quester detects a local error while sending the request packet, the state 20 
of the responder's memory is unknown. 21 

22 

9.8.2.2 Responder Behavior 23 

This section specifies the responder's required behavior when receiving 24 

inbound requests. 25 



26 
27 
28 



The responder maintains an Expected PSN value (ePSN) that it uses to 
detect missing packets from a multi-packet request message and to de- 
tect dropped messages. Since the PSN of every inbound request packet ^9 
is sequential and monotonically increasing for UC service, a break in the 30 
PSN sequence indicates a lost or dropped request packet. 31 

32 

C9-1 81 : For an HCA responder using Unreliable Connection service, the 33 
responder shall maintain an Expected PSN value (ePSN). This is the PSN 
that the responder expects to find in the BTH of the next inbound request 
packet. 2^ 

36 

o9-126: If a TCA responder implements Unreliable Connection service, 37 
the responder shall maintain an Expected PSN value (ePSN). This is the 38 
PSN that the responder expects to find in the BTH of the next inbound re- 39 
quest packet. 

41 
42 
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The initial expected PSN can only be set by the client when the queue is 
in the Initialized state. Attempts by the client to set the PSN when it is in 
any other state may be ignored by the transport layer. 
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The responder's expected PSN may be initialized at connection establish- 1 

ment time by the transport's client to any value between zero and 2 

16,777,215. However, since the responder will accept any valid packet 3 

with an opcode of "first" or "only", and use the value of the PSN contained ^ 
in such a packet as its expected PSN, it is not required that the re- 
sponder's initial expected PSN be programmed. See Chapter ( Chapter 

12: Communication Management on page 516 for a full description of the ^ 

mechanism for loading the expected PSN at connection establishment 7 

time. 8 

9 

10 
11 
12 

09-182: For an HCA using Unreliable Connection service, the transport 13 
layer shall modify (update) its expected PSN only when the receive queue 1 4 
is in a proper state to receive inbound request packets. For example, for 1 5 
an HCA, the transport layer does not modify the PSN when the queue pair ^ g 
is in the Initialized state. 

1 R 

09-127: If a TCA implements Unreliable Connection service, the transport 
layer shall modify (update) its expected PSN only when the receive queue ^ ^ 
is in a proper state to receive inbound request packets. For example, for 20 
an HCA, the transport layer does not modify the PSN when the queue pair 21 
is in the Initialized state. 22 

23 

09-183: For an HCA responder using Unreliable Connection service, an 
inbound request packet shall be declared out of order if its PSN does not 
exactly match the responder's current ePSN. 

26 

09-128: If a TCA responder implements Unreliable Connection service, 27 
an inbound request packet shall be declared out of order if its PSN does 28 
not exactly match the responder's current ePSN. 29 



30 
31 



09-1 84: An HCA responder using Unreliable Connection service shall be- 
have as follows. If, during packet validation, an inbound request packet is 

discovered with an OpCode of "first" or "only", the responder shall accept 32 

the packet and shall accept the PSN of that request message as its new 33 

ePSN, regardless of whether the inbound packet is out of order or not. 34 

This shall be done regardless of the previous value of ePSN. 35 



36 
37 



09-129: A TCA responder implementing Unreliable Connection service 
shall behave as follows. If, during packet validation, an inbound request 
packet is discovered with an OpCode of "first" or "only", the responder 
shall accept the packet and shall accept the PSN of that request message 39 
as its newePSN, regardless of whether the inbound packet is out of order 40 
or not. This shall be done regardless of the previous value of ePSN. 41 

42 
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4 

5 



9,8.2.2.2 Responder - OpCode Sequence Check 

A request packet must fit within a schedule of valid OpCode sequences 
The OpCode sequence is determined by examining the BTH:OpCode. 



10 
11 



09-185: For an HCA responder using Unreliable Connection service, be- 1 
fore executing an inbound request, the responder shall check the PSN by 2 
comparing the PSN in the inbound BTH to the responder's expected PSN. 3 
The rules that the responder uses to calculate its next expected PSN shall 
be the same as those used by the requester when it calculates the PSN 
value to insert in its next request packet These rules are given in 9.8.2.1.1 
Requester - Generating PSN on page 319 . ^ 

7 

O9-130: For an HCA responder using Unreliable Connection service, be- 8 
fore executing an inbound request, the responder shall check the PSN by g 
comparing the PSN in the inbound BTH to the responder's expected PSN. 
The rules that the responder uses to calculate its next expected PSN shall 
be the same as those used by the requester when it calculates the PSN 
value to insert in its next request packet. These rules are given in 9.8.2.1.1 ^ ^ 
Requester - Generating PSN on page 319 . 13 

14 

o9-131 : If the PSN of the inbound message does not match the re- 1 5 
spender's ePSN, the responder may notify its client of the presence of one ^ g 
or more lost messages. The mechanism by which the responder notifies 
its client is outside the scope of this specification. ^ ^ 

09-186: For an HCA responder using Unreliable Connection service, if a ^9 

multi-packet message is in progress at the time that an out of order packet 20 

is detected, the current message shall be silently dropped. The responder 21 

then waits for the first packet of a new message. It is possible that the 22 

present packet (the out of order packet) is the first packet of a new mes- 23 

sage. If so, it shall be treated as a new message. ^4 

o9-132: If a TCA responder implements Unreliable Connection service, if 

a multi-packet message is in progress at the time that an out of order 26 

packet is detected, the current message shall be silently dropped. The re- 27 

spender then waits for the first packet of a new message. It is possible that 28 

the present packet (the out of order packet) is the first packet of a new 29 
message. If so, it shall be treated as a new message. 

A "new message" is denoted by an inbound request packet with an Op- 
Code in the BTH of "first" or "only". 32 

33 

"Current message" means all the packets received since the most re- 34 
cently received "first" or "only" OpCode, excluding the present packet. 35 

36 
37 
38 
39 

09-187: For an HCA responder using Unreliable Connection service, the 40 
responder shall check the sequence of packet OpCodes as described in 41 
items (1 ) through (5) below: 42 
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1 ) If this is the first packet following establishment of the connection, 1 
then the packet OpCode must indicate either "first" or "only". An Op- 2 
Code of "middle" or "last" implies that at least the first packet of the 3 
current message was lost and denotes an invalid OpCode sequence. ^ 

2) If the last valid packet received had an OpCode indicating "first", then 5 
the current OpCode must indicate either "middle" or "last". It must g 
also match the operation type specified in the last valid packet 
(SEND, RDMA WRITE). A current OpCode of "first" or "only" implies 
that at least the last packet of the previous message was lost and de 
notes an invalid OpCode sequence. 9 

3) If the last valid packet received had an OpCode indicating "middle", 
then the current OpCode must indicate either "middle" or "last". It 
must also match the operation type specified in the last valid packet 
(SEND or RDMA WRITE request). A current OpCode of "first" or 13 
"only" implies that at least the last packet of the previous message 14 
was lost and denotes an invalid OpCode sequence. 15 

4) If the last valid packet received had an OpCode indicating "last", then 1 6 
the current OpCode must indicate either "first" or "only". A current 1 7 
OpCode of "middle" or "last" implies that at least the first packet of the ^ g 
current message was lost and denotes an invalid OpCode sequence, 

5) If the last valid packet received had an OpCode indicating "only", then 20 
the current OpCode must indicate either "first" or "only". A current 21 
OpCode of either "middle" or "last" implies that the first packet of the 22 
current message was missed and denotes an invalid OpCode se- 
quence. 



7 
8 



10 
11 



23 
24 

o9-133: If a TCA responder implements Unreliable Connection service, 25 
the responder shall check the sequence of packet OpCodes as described 
in items (1) through (5) above. 



26 
27 

The responder's behavior in the presence of an invalid OpCode sequence 28 
is specified in Section 9.9.3 Responder Side Behavior on page 349 . 29 

30 

09-188: For an HCA responder using Unreliable Connection service, if 3^ 
the responder detects an invalid OpCode sequence, the current message ^2 
shall be silently dropped. The responder then waits for a new inbound re- 
quest packet with an OpCode of "first" or "only"; any other inbound request 
packet shall be silently dropped. 34 

35 

09-134: If a TCA responder implements Unreliable Connection service, 36 
and if the responder detects an invalid OpCode sequence, the current 37 
message shall be silently dropped. The responder then waits for a new in- 
bound request packet with an OpCode of "first" or "only"; any other in- 
bound request packet shall be silently dropped. 

40 
41 
42 
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"Current message" means all the packets received since the most re- 
cently received "first" or "only" OpCode, excluding the present packet. 

C9-189: For an HCA responder using Unreliable Connection service, if 
the present packet, which caused the invalid OpCode sequence, has an 
OpCode of "first" or "only" it shall be treated as the first packet of a new 
request message. 

09-135: If a TCA responder implements Unreliable Connection service, 
and if the present packet, which caused the invalid OpCode sequence, 
has an OpCode of "first" or "only" it shall be treated as the first packet of 
a new request message. 

The list of valid OpCode sequences is summarized in the following table. 
Table 52 Summary: Valid OpCode Sequences 



Previous Packet OpCode 


Valid Opcodes for Current Packet 


None e.g., first packet following 
connection establishment 


"First" packet 
"Only" packet 


"First" packet 


"Middle" packet (message is 3 or more packets) 
"Last" packet (message is exactly 2 packets) 
Type of operation must match the previous OpCode 


"Middle" packet 


"Middle" packet 
"Last" packet 

Type of operation must match the previous OpCode 


"Last" packet 


"First" packet (1st packet of a new message) 

"Only" packet (1st packet of a new single packet msg) 


"Only" packet 


"First" packet 
"Only" packet 



9.8.2.2.3 Responder OpCode Validation 

C9-190: For UC, the responder shall validate the requested function 
(SEND or RDMA WRITE) is supported by the receive queue and that the 
BTH:OpCode is not reserved before executing the request. 

Note that the OpCode was also examined as part of packet validation in 
section 9.6 Packet Transoort Header Validation on pace 228 to ensure 
that the inbound packet contains a request for Unreliable Connected ser- 
vice. 

09-191: Invalid UC requests shall be silently dropped by the responder 
per 9.9.3 Responder Side Behavior on page 349 . 
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The R_Key field in the RETH is valid, 

The virtual address and length specified in the RETH are within 
the locally defined limits associated with the R_Key, 
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9.8.2.2.4 Responder Remote Access Validation 1 

C9-192: For an HCA responder using Unreliable Connection service, if 2 
the inbound request is for a RDMA WRITE and the requested DMA length 3 
in the RETH is non-zero, then the following conditions shall be checked: 4 

5 
6 
7 
8 

• The type of access specified (Write) is within the locally defined 9 
limits associated with the R_Key. ^ q 

A failure of any of these checks constitutes an R_Key violation. The re- ^ ^ 
spender's behavior in response to an R_Key violation is specified in Sec- ^2 
tion 9.9.3 Responder Side Behavior on page 349 . ^ ^ 

09-136: If a TCA responder implements Unreliable Connection service 
and RDMA functionality, it shall conform to the preceding HCA compli- 15 
ance statement 16 

17 

09-193: For an HCA using Unreliable Connection service, the R_Key field ^ g 
shall not be checked for a zero-length RDMA WRITE request, even if the 

request includes Immediate data. 

^ 20 

o9-137: If a TCA responder implements Unreliable Connection service ^1 

and RDMA functionality, the R_Key field shall not be checked for a zero- 22 

length RDMA WRITE request, even if the request includes Immediate 23 

data. 24 

25 

9.8.2.2.5 Responder - Length Validation 

26 

09-194: For an HCA responder using Unreliable Connection service, the 
PktLen field of the LRH shall be checked to confirm that there is sufficient 
space available in the receive buffer specified by the receive WQE. This 
check applies only to SENDs or to RDMA WRITES with immediate data. 29 

30 

o9-138: If a TCA responder implements Unreliable Connection service, 31 
the PktLen field of the LRH shall be checked to confirm that there is suffi- 32 
cient space available in the receive buffer specified by the receive WQE. ^3 
This check applies only to SENDs or to RDMA WRITES with immediate 

data. 34 

35 

The length of the packet shall also be validated by comparing it to the Op- 36 
Code as follows: 37 

38 

09-195: For an HCA responder using Unreliable Connection service, if gg 
the UC BTH:OpCode specifies a "first" or "middle" packet, then the packet 
payload length must be a full PMTU size. 

41 
42 
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09-139: If a TCA responder implements Unreliable Connection service, 1 

and if the UC BTH:OpCode specifies a "first" or "middle" packet, then the 2 

packet payload length must be a full PMTU size. 3 

4 

C9-196: For an HCA responder using Unreliable Connection service, if 
the UC BTH:OpCode specifies a "only" packet, then the packet payload 

length must be between zero and PMTU bytes in size. Thus, the only way ^ 

to create a zero byte length transfer is by use of a single packet message. 7 

8 

O9-140: If a TCA responder implements Unreliable Connection service, g 
and if the UC BTH:OpCode specifies a "only" packet, then the packet pay- ^ q 
load length must be between zero and PMTU bytes in size. Thus, the only ^ ^ 
way to create a zero byte length transfer is by use of a single packet mes- 
sage. ^2 

13 

09-197: For an HCA responder using Unreliable Connection service, if 14 

the UC BTH:OpCode specifies a "last" packet, then the packet payload 15 
length must be between one and PMTU bytes in size. 



17 
18 



09-141: If a TCA responder implements Unreliable Connection service, 
and if the UC BTH:OpCode specifies a "last" packet, then the packet pay 
load length must be between one and PMTU bytes in size. 

20 

09-198: For an HCA responder using Unreliable Connection service, if 21 
the request is an RDMA WRITE, the total amount of payload data re- 22 
ceived shall be compared to the DMA Length field specified in the RETH. 23 

24 

o9-142: If a TCA responder implements Unreliable Connection service 
and RDMA functionality, and if the request is an RDMA WRITE, the total ^5 
amount of payload data received shall be compared to the DMA Length 26 
field specified in the RETH. 27 

28 

09-199: For an HCA responder using Unreliable Connection service, if 29 
the BTHiOpCode field[4:0] specifies a first or middle request packet (e.g. 
SEND First, or RDMA WRITE Middle), the pad count bits are verified to 
be bOO, indicating no pad bytes are present. If the pad count bits are non- 
zero, the OpCode is invalid. 32 

33 

o9-143: If a TCA responder implements Unreliable Connection service, 34 
and if the BTH:OpCode field[4:0] specifies a first or middle request packet 35 
(e.g. SEND First, or RDMA WRITE Middle), the pad count bits are verified 3^ 
to be bOO, indicating no pad bytes are present. If the pad count bits are 
non-zero, the OpCode is invalid. 

38 

If a packet is detected with an invalid length, or the total amount of RDMA 39 
WRITE data does not match the DMA Length field in the RETH, the re- 40 
quest is an invalid request. The responder's behavior in such a case is 41 
specified in Section 9.9.3.1 Responder Side Error Response on page 351 . 42 



30 
31 



InfinlBand^'^ Trade Association 



Page 328 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



9.8.3 Unreliable Datagrams 



QP to send each message to one of nnany destination QPs that may exist 
on the same or multiple destination endnodes. 
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9.8.2.2.6 Responder - Local Operation Validation 1 

A valid inbound request may still fail to complete due to a failure that is 2 

local to the responder, e.g. local memory translation error while accessing 3 

local memory. A local error may cause the receive queue to transition to 4 

the error state. See 9.9.3 Responder Side Behavior on page 349 for addi- ^ 
tional details. 

6 

9.8.2.2.7 Completing a Message Receive ^ 

The responder considers a given inbound message completed success- ^ 
fully when it has: 9 

10 

Detected the beginning of a valid message as indicated by the pres- 
ence of a "First packet" or "Only packet" OpCode in the BTH, ^2 

Detected the end of the same valid message as indicated by the 1 3 
presence of a "Only packet" or "Last packet OpCode in the BTH, ^4 
without a skip in the PSN sequence, ^ ^ 

• Received all the packets between "First packet" and "Last packet" in- 
elusive successfully and in order, or has successfully received an 
"Only packet". 

Committed the message payload to the local fault zone without error, 
and, 20 

• Successfully completed all appropriate validity checks (including vari- 21 
ant and invariant CRC). 22 

A failure detected during any of these steps may or may not cause the as- 23 
sociated WOE to be completed in error. In some cases, such as a missing 24 
"first" packet, it is entirely likely that no WOE will be consumed by the re- 25 
spender. Note that, in the presence of errors, it is not possible to guar- 
antee the state of the responder's memory. Some or all of a given packet 
may have been committed to the responder's memory before the error is 27 
detected. 28 

29 

Once an inbound message receive is completed successfully, the re- 30 
spender completes the current WQE. 2i 

32 
33 

Unreliable Datagrams are a form of communication that allow a source 

35 
36 

For each message to be sent, the requester must be supplied 37 
with the destination address (see 11.2.2.1 Create Address Han- 38 
die on oaae 456 V the destination QP, the destination Q_Key etc. 39 
See 11.4.1.1 Post Send Request on page 496 for the parameters 4Q 
supplied for an HCA. 

42 
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The responder must deliver to the client the requester's address, 
QP etc. See 11 .4.2.1 Poll for Completion on page 502 for more 
detail on HCA requirements. 

Table 53 Unreliable Datagram QP characteristics 



Property / Level of Reliability 


Support 


Corrupt data detected and dropped 


Yes, silent drop on error 


OpCode Service and command Validation 


Yes, silent drop on error 


Receive buffer overrun 


Yes, reported as WR error 


Data repeated 


No 


Data order guaranteed 


No 


Data loss detected 


Not required 


RDMA Support 


No 


ATOMIC Support 


No 


Immediate data support 


Yes 


Max Size of SEND messages 


PMTU-sized packet - 256 - 4096 
bytes of data payload. Any mes- 
sage that exceeds the PMTU 
will not be delivered. 


State of SEND when request completed 


Committed to transmission on 
the fabric 



C9-200: Devices that source UD messages shall limit the UD message 
size to a single pacl<et. The packet should be no larger thian the PMTU be- 
tween the source and destination (or it will be dropped). 

C9-201: Devices that source and sink UD messages shall meet the re- 
quirements of the basic Unreliable Services (see 9.8 Unreliable Service 
on page 315 through 9.8.1 Validatina and Executing Requests on page 
316). 

C9-202: Devices that source and sink IBA UD messages shall meet the 
requirements specified in 9.9 Error detection and handling on page 336 . 
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Processor 1 



This figure shows two views of Unreliable 
Datagram QPs. Process A on Processor 1 
communicates with three processes: pro- 
cesses C and D on Processor 2 and process 
E on processor 3. 



The view on the right shows how software might view the con- 
nection. Buffers in the Send Q flow into buffers in the Receive 
Queue on the connected QP. 

The lower view gives a hardware centric view, showing some of 
the state maintained per connected QP. Since each QP is con- 
nected to nothing, the destination and other necessary informa- 
tion (SL, DGID etc.) is picked with each message. The PSNs, 
while generated, are not checked. 



Processor 2 

Process C'QP 24 " 

'Rev Buff 

t . 

QP24 V^''^ 



PfoceVsof 1 

Process A QP=4 

Rev Buff 

" t 

Rev Buff 



Snd 
A 



Snd 

' QP4 







•a 


> 


c 
<u 




w 













Rev Buff 



(CA DLID = 33) 
QP 4 State: 
DLID = in WQE 
Destination QP = in WQE 
XMit PSN = 5 
Rev ePSN = NA 



Snd 



Snd 



ProcesVD W25 

; ■ Rev Buff 



Snd 
A 



0) 



Rev Buff 



Snd 
QP25 



Rev Buff 



RevBuff 



(CA DLID = 27) I— ' I— t 



QP 24 State: 
DLID = in WQE 
Destination QP = in WQE 
XMit PSN = 104 
Rev ePSN = NA 



QP 25 State: 
DLID = in WQE 
Destination QP = in WQE 
XMit PSN = 42 
Rev ePSN = NA 



Processor 2 
Process C 
QP24 




Snd 



Wobessdr^ 
Process E QP 14 

Rev Buff 

t 

RevBuff 



Rev Buff 



Snd 




^ QP14 












c 

0) 

















■9£- 












(CA DLID = 54) 
QP 14 State: 
DLID = in WQE 
Destination QP = in WQE 
XMit PSN = 72 
Rev ePSN = NA 



Z 



System Area 0 
Network Fabric ) 



Figure 104 Connectionless QPs for 
Unreliable Datagram Operation 
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9.8.3.1 Requester Behavior 



This section specifies the requester's required behavior when generating 
request pacl<ets. 

C9-203: Devices that source UD messages shall meet the requirements 
specified in 9.8.3.1.1 Generatino PSN on page 332 and 9.8.3.1.2 Com- 
pleting a Message Send on page 332 while sending UD messages. 



9.8.3.1.1 Generating PSN 



C9-204: For each request message on a UD transport service, the re- 
quester shall generate PSNs that is an increment of "1" (modulo 2^^*) of 
the PSN value of the preceding request packet. 

The initial PSN value shall be loaded by the transport's client while the 
send queue is in the Initialized state and may be initialized to any 24-bit 
value. While in the process of transmitting request packets, the transport 
layer shall modify (update) the PSN only when the send queue is in the 
Ready to Send state. 

9.8.3.1.2 Completing a Message Send 

The requester shall consider a message Send complete when it has: 

• Committed the last byte of the VCRC field of the packet to the wire, 
and detected no local errors associated with the message transfer. 

Detected a local error associated with the message transfer that 
causes the requester to terminate sending the request. 

Note that at the time that the requester completes the send WQE, the 
state of the memory at the responder is unknown. Likewise, if the re- 
quester detects a local error while sending the request packet, the state 
of the responder's memory is unknown. 

9.8.3.2 Responder Behavior 

This section specifies the responder's required behavior when receiving 
inbound requests. 

C9-205: Devices that receive UD messages shall meet the requirements 
specified in 9.8.3.2.2 Responder QpCode Validation on page 333 . 
9.8.3.2.3 Responder - Length Validation on oaoe 333 . and 9.8.2.2.7 Com- 
pleting a Message Receive on page 329 when receiving a UD message. 

9.8.3.2.1 Responder - Validating the PSN 

09-144: For UD transport service, the responder may ignore the PSN 
field. 
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Some applications (e.g. multicast-based media streaming) may derive 1 

benefit from having the responder validate the PSN sequence to detect 2 

out-of-sequence packets. It is pemiissible for a responder implementation 3 

to do so, but is outside the scope of the IBA specification. ^ 



If the request is invalid, it shall be silently dropped by the responder as 
specified in Section 9.9.3 Responder Side Behavior on pace 349 . 



The Length fields shall be checked to confirm that there is sufficient 
space available in the receive buffer specified by the receive WQE. 

The packet payload length must be between zero and PMTU bytes 
inclusive in size. 



5 
6 



9.8.3.2.2 Responder OpCode Validation 

C9-206: For UD, the responder shall validate the BTH:OpCode for the re- 
quested function (SEND) is supported by this receive queue and is not re- ^ 
served before executing the request else the request is invalid. 8 

9 

C9-207: If a UD receive queue does not have an entry to hold an inbound 1 q 
SEND request, the request is invalid. 

12 
13 
14 

9.8.3.2.3 Responder - Length Validation 1 5 

Before executing the request, the responder shall validate the Packet 16 
Length field of the LRH and GRH and the PadCnt of the BTH. The fol- 1 7 
lowing characteristics shall be validated: -jg 

19 
20 
21 
22 
23 
24 
25 



If a packet is detected with an invalid length, the request shall be an invalid 
request and it shall be silently dropped by the responder as specified in 
Section 9.9.3 Responder Side Behavior on page 349 . The responder then 

waits for a new request packet. 26 

27 

9.8.3.2.4 Responder - Local Operation Validation 28 

A valid inbound request may still fail to complete due to a failure that is 29 

local to the responder, e.g. local memory translation error while accessing 30 

local memory. A local error may cause the receive queue to transition to 3^ 
the error state. See 9.9.3 Responder Side Behavior on page 349 for addi- 
tional details. 

33 

9.8.3.2.5 Completing a Message Receive 34 

The responder shall consider a given inbound message completed sue- 35 

cessfully when it has: 36 

37 

• Committed the message payload to the local fault zone without error 33 

• Successfully completed all appropriate validity checks (including vari- 39 
ant and invariant CRC). 40 

A failure detected during any of these steps may or may not cause the as- 41 

sociated WQE to be completed in error. In some cases, such as an op- 42 
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9.8.4 Raw DATAGRAMS 



code or length error, no WQE will be consumed by the responder. Note 
that, in the presence of errors, it is not possible to guarantee the state of 
the responder's memory. Some or all of a given packet may have been 
committed to the responder's memory before the error is detected. Once 
an inbound message receive is completed successfully, the responder 
completes the current WQE. 



The previous several sections describe the different transport protocols 
defined by the IBA specification. In addition to these, IBA allows other pro- 
tocols to be carried by an IBA subnet. IBA datagrams that encapsulate 
such traffic are referred to as Raw Datagrams. 

IBA defines two different methods to support Raw Datagrams. In Section 
7.7.5 Link Next Header (LNH) - 2 bits on page 160 two bits in the local 
route header are used to specify the next header after the LRH. The fol- 
lowing table describes the two LNH encodings that describe Raw Data- 
grams. 



Link Next Header LNH(1:0) 


IBA_Transport 


GRH 

(IPv6) 
header 


0 


1 


0 


0 



Structure of the Raw Datagram 



LRH 


IPv6 


Packet Payload 


VCRC 


LRH 


RWH 


Packet Payload 


VCRC 



Figure 105 Raw Datagrams 



The first method of encoding a Raw Datagram is used only for IPv6 data- 
grams. The pacl<et payload may contain any transport or network protocol 
defined by the lETF's encoding of the IPv6 header's "next header" field 
excluding any encoding indicating the next header is an IBA transport 
header. 

C9-208: CAs shall not generate an outbound packet and will discard any 
inbound packet whose LRH indicates a Raw Datagram and whose IPv6 
"next header" indicates an IBA transport. TCAs may report this error in 
any manner they choose. 

The second method of encoding a Raw Datagram uses the IBA defined 
raw header (RWH). The RWH contains the 16-bit Ethertype field - the 
RWH is described in section 5.3 Raw Packet Format on oaae 128 . The 
RWH is used to define the protocol header encapsulated in the packet 
payload. In general, the second method is used to allow protocols not sup- 
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ported by the IPv6 next header - it should be noted that either method may 
be used to transport IPv6 datagrams. 

09-145: If a CA implements Raw Datagram support, the Packet Payload 
of Raw Datagrams must always be of a modulo 4 size, since the LRH 
Packet length describes the length in 4 byte increments. Should the en- 
capsulated payload size not be a multiple of 4 bytes, the payload shall be 
padded to a multiple of 4 bytes. 

09-146: If a CA implements Raw Datagram support, the QPs used to in- 
ject and consume Raw Datagrams shall be locally managed, i.e. the as- 
sociation of a QP with a given Raw Datagram service is implementation 
dependent. 

09-147: If a CA implements Raw Datagram support, it may support one 
or more QPs for Raw Datagram operations. 

9,8.4.1 Raw Datagram Packet Size 

The IBA MTU defines the maximum size of an IBA transport's data pay- 
load. The maximum size of an IBA packet is MTU+124 bytes'* (see 7.7.8 
Packet Length (PktLen) - 11 bits on oaae 160 ). 

09-148: If a CA implements Raw Datagram support, and since a Raw da- 
tagram does not use IBA transport headers, raw datagrams may have a 
packet payload larger than the supported MTU (see Figure 105 Raw Da- 
tagrams on page 334 ). The table below summarizes the maximum packet 
payload (and the corresponding value for the LRH PktLen field) for each 
of the two raw datagram types. 

Table 54 Maximum Raw Datagram Packet Payload 



MTU 


IPv6 Raw Datagram 


RWH Raw Datagram 


Largest Possible 
Packet Payload^ 


Corresponding 
PIctLen Value 


Largest Possible 
Packet Payload^ 


Corresponding 
PktLen Value 


256 


332 Bytes 


83 


368 Bytes 


92 


512 


588 Bytes 


147 


624 Bytes 


156 


1024 


1100 Bytes 


275 


1136 Bytes 


284 


2048 


2124 Bytes 


531 


2160 Bytes 


540 


4096 


4172 Bytes 


1043 


4208 Bytes 


1052 



a. largest possible IPv6 raw packet payload = MTU + 124 (the largest packet 

header/CRC size) - 8 (LRH size) - 40 (IPv6 header size) 

b. largest possible RWH raw/ packet payload = MTU + 124 (the largest packet 

header/CRC size) - 8 (LRH size) - 4 (RWH header size) 



1 . The 124B maximum packet header/CRC byte count does not include the 
VCRC field. 
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9.9 Error detection and handling 



Send Requester 

Queue CQ 



Requester 
Local Error Detection 




f Remote Error A 
I Detection J 



Request 
Packets 



Response 
(ACK) 
Packets 



/Acknowledge^ 
Generation 



Responder 
.Local Error Detection y 



Receive CQ 

Queue Responder 



IBA uses a layered error management architecture (LEMA) approach. 
Each level is responsible for detecting and managing errors appropriate 
to that layer before passing the packet or message up to the next layer in 
the stack. 

Thus the transport layer responds to errors particular to the transport in- 
cluding errors in the packet header and failures to correctly transport a 
message. 

Errors detected in the transport layer are reported to the transport's client. 
In this section, the interface between the transport layer and its client is 
shown conceptually as the send or Receive Queue. In the case of an 
HCA, the transport indicates errors to its client by writing a completion 
code to a Completion Queue Entry (CQE) on the Completion Queue (CQ). 
As usual TCAs are free to report errors (or not) as they see fit. 

In order to simplify the discussion, error behavior is discussed separately 
for the requester and responder ends. This causes a slight amount of du- 
plication between the summary tables in the following sections describing 
the errors for the requester and responder side. Specifically, overlaps 
occur when an error is detected by the responder and reported to the re- 
quester. These areas of overlap, however, are strictly confined to reliable 
classes of service. 

Errors that are reported by the requester to its client fall into one of two 
classes. The first are Locally Detected errors; i.e., errors that are detected 
solely by the requester side. An example of a locally detected error is a 
protection fault detected by the requester while accessing its own local 
memory during a send request. 

The second class is remotely detected errors, which are those errors de- 
tected by the responder and reported to the requester via a NAK syn- 
drome in an Response packet. Remotely detected errors only apply to the 
reliable classes of service (reliable connected and reliable datagram). 

Whereas there were two classes of errors for the requester side (locally 
and remotely detected), there are only locally detected errors on the re- 
sponder side. 

In response to a locally detected error, the responder side may be re- 
quired to report the error to the requester, or it may be required to report 
the error to its local client, or both, or neither. The choice of to whom the 
error is reported is governed by the class of service (reliable versus unre- 
liable), and the specific error that is detected. 



Figure 106 Requester/ 
Responder Error Detection 
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The key focus of the following sections is to categorize all errors according 
to how errors are reported to the transport layer's client, and the behavior 
that the send (receive) queue must exhibit following detection of an error. 
Thus, this section is categorized according to not only where an error is 
detected, but to whom it is reported. 



9.9.1 Reporting Errors to the Verbs Layer 



For an HCA, the IBA software interface defines three types of errors that 
can be reported through the verbs layer. These are called immediate er- 
rors, completion errors, and asynchronous errors. Of those three types, 
the transport layer is only capable of reporting completion errors or asyn- 
chronous errors. This is because immediate errors are detected by the 
verbs layer before the WQE ever gets posted to the transport layer. Table 
55 summarizes the types of errors that an IBA transport can detect and 
report to the verbs layer. For more information on these error types, see 
10.10.2 Error Handling Mechanisms on page 434 . 

Table 55 Software Error Types Detected by Transport Layer 



IbA Software DeHned Eh-br types 


Detected by 
IBA transport 


Immediate Errors 


no 


Completion Errors - Interface check 


yes 


Completion Errors - Processing error 


yes 


Asynchronous Errors - Affiliated type 


yes 


Asynchronous Errors - Unaffiliated type 


yes 



There are two classes of completion errors: Interface checks and pro- 
cessing errors. An interface check is an error in the information supplied 
to the Channel Interface detected before data is placed onto the link. A 
processing error is an error encountered during the processing of the work 
request by the Channel Interface. 



9.9.2 Requester Side Error Behavior 



As indicated above, the requester detects errors originating locally or re- 
motely. 



9.9.2.1 Requester Side Error Detection - Locally Detected Errors 



A locally detected error reflects either an error condition that has occurred 
in the requester's channel interface, a missing response from the re- 
sponder side (timeout) or excessive retries for sequence errors or RNR 
NAKs. 

Locally detected errors at the requester can occur during request packet 
generation, during the processing of response packets, or due to a tim- 
eout. 
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C9-209: For an HCA requester using RC, UC, or UD service, and for a 1 
TCA requester using UD service, the requester shall behave as follows. 2 
For locally detected transport errors that are detected during transmission 3 
of request packets, a CA shall stop transmission for the affected QP, shall ^ 
store the state associated with the error until any previous incomplete 
WQEs are completed, and finally complete the affected WQE. The QP 
shall be put into the error state if the error type requires this. ^ 

7 

09-149: If a TCA requester implements Reliable Connection or Unreliable 8 
Connection service, it shall behave as follows. For locally detected trans- g 
port errors that are detected during transmission of request packets, a CA 
shall stop transmission for the affected QP, shall store the state associ- 
ated with the error until any previous incomplete WQEs are completed, 
and finally complete the affected WQE. The QP shall be put into the error ^ ^ 
state if the error type requires this. ^ 3 

14 

o9-1 50: If a CA requester implements Reliable Datagram service, it shall 1 5 
behave as follows. For locally detected transport errors that are detected ^ g 
during transmission of request packets, a CA shall stop transmission for 
the affected EEC, shall store the state associated with the error until any 
previous incomplete WQEs are completed, and finally complete the af- 
fected WQE. The EEC shall be put into the error state if the error type re- ^ 9 
quires this. 20 

21 

This is required to maintain the ordered completion of WQEs and to en- 22 
sure that the error is properly reported in the WQE where the error oc- 23 
curred. 

24 

9.9.2.1.1 Requester Error Retry Counters 25 

09-21 0: For an HCA using Reliable Connection service, in order to detect 
excessive retries, the requester shall maintain the RNR NAK and Error 27 
retry counters that perform the logical functions described in 9.9.2.1.1 Re- 28 
quester Error Retry Counters on page 338 . 29 

30 

09-151: If a C A requester implements Reliable Datagram service, or if a 
TCA requester implements Reliable Connection service, the requester 
shall behave as follows. In order to detect excessive retries, the requester 
shall maintain the RNR NAK and Error retry counters that perform the log- 33 
ical functions described in 9.9.2.1.1 Requester Error Retrv Counters on 34 
oaae 338 . 35 

36 

Implementations may implement these retry counters in any way they ^7 
choose, but for clarity, they are here described as down counters, initial- 
ized to the number of retries allowed before terminating the operation and 
creating the final completion error. See 10.2.1.3 Modifving HCA Attributes 
on page 371 for the programming of these counters in an HCA. 40 

41 
42 
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The RNR NAK reti7 counter is decremented each time the responder re- 1 

turns an RNR NAK. If the requester's RNR NAK retry counter decre- 2 

ments to zero, an RNR NAK retry error occurs. Each time an RNR NAK is 3 

cleared (i.e., an acknowledge message other than an RNR NAK is re- ^ 
turned), the retry counter is reloaded. An exception to the following is if the 

RNR NAK retry counter is set to 7. This value indicates infinite retry and ^ 

the counter is not decremented. 6 

7 

The Error retry counter is decremented each time the requester must retry 8 

a packet due to a Timeout, NAK-Sequence Error, or Implied NAK. If the 9 
requester's retry counter decrements to zero, one of two things may be 
implemented. 



9.9.2,2 Requester Side Error Detection - Remotely Detected Errors 



Of the possible NAK codes, two (NAK-Sequence Error and NAK-RNR) in- 
dicate operations that should be retried automatically by the requester. 



9.9.2.3 Summary - Requester Side Error Behavior 



10 
11 



If Automatic Path migration is not supported, or has already been com- ^ ^ 
pleted, a "Local Packet Timeout Retry Count Exceeded" error shall be re- 1 3 
ported in the completion. 14 

15 

09-152: If a CA supports Automatic Path Migration, then, following a po- 
tentially recoverable error and its retries, the requester may migrate the 
connection or EE context and perform the Error retries again before finally 
reporting the completion in error. See 17.2.8 Automatic Path Migration on ^ 
page 804 for more information. ^ 9 

20 

Each time a packet is properly acknowledged, the retry counter shall be 21 
reloaded. 22 

23 
24 

A remotely detected error occurs when the responder reports an error to 25 
the requester. Remotely detected errors are unique to reliable classes of 
service. 



26 
27 

Remotely detected errors are reported via a NAK code carried in an ac- 28 
knowledge message. However, not all NAK codes result in an error being 29 
reported to the requester's client. 30 

31 
32 
33 

The NAK codes other than NAK-Sequence Error and NAK-RNR indicate 34 
failures that must be reported to the requester's client immediately and 35 
cannot be retried. 36 

37 

38 

Table 56 lists all errors that are detected by the requester, including both 39 
locally and remotely detected errors. If the error is detected locally by the 
requester, the column labelled "Syndrome" contains the notation "locally 
detected error". If the error is detected remotely by the responder, this 
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column lists the NAK syndrome that was returned by the responder. The 1 
fault behavior class specifies the actions that the requester takes to report 2 
the error to its client. 3 

4 

Each fault behavior class is specified below in Sections 9.9.2.4.1 through 
9.9.2.4.5. For convenience, the six possible classes of fault behaviors are ^ 
summarized in Table 57 Summary of Requester Fault Behavior Classes ^ 
on page 342 below. 7 

8 

C9-211 : For the implemented subset of transport services, requesters g 
shall conform to the error behavior as specified in Table 56. Also, the re- ^ q 
quester's send queue shall behave as specified for each Fault Behavior 
Class shown in Table 57 and each of the Requester Class Fault descrip- 
tions. ""^ 

13 
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Table 56 Requester Side Error Behavior 



Error 


Description 


Syndrome 


Requestor Fault 
Denavior uiass 


Packet sequence error. 
Retry limit not exceeded. 


Responder detected a PSN larger than it 
expected. 

Requester may retry the request. 


NAK-Sequence 
Error 


RC: Class A 
RD: Class A 
else: NA 


Packet sequence error. 
Retry limit exceeded. 


Responder detected a PSN larger than it 
expected. 

Requester retried the request "n" times, but 
it failed each time. 


NAK-Sequence 
Error 


RC: Class B 
RD: Class D 
else: NA 


Implied NAK sequence 
error. Retry limit not 
exceeded. 


Requestor detected an ACK with a PSN 
larger than the expected PSN for an RDMA 
READ or ATOMIC response. 
Requester may retry the request. 


locally detected error 


RC: Class A 
RD: Class A 
else: NA 


Implied NAK sequence 
error. Retry limit 
exceeded. 


Requestor detected an ACK with a PSN 
larger than the expected PSN for an RDMA 
READ or atomic response. 
Requester retried the request "n" times, but 
it failed each time. 


locally detected error 


RC: Class B 
RD: Class D 
else: NA 


Packet Timeout error. 
Retry limit not exceeded. 


No ACK response from responder within 
timer interval. 

Requester may retry the request. 


locally detected error 


RC: Class A 
RD: Class A 
else: NA 


Packet Timeout error. 
Retry limit exceeded. 


No ACK response within timer interval. 
Requester retried the request "n" times, but 
it timed out each time. 


locally detected error 


RC: Class B 
RD: Class D 
else: NA 


RNR NAK Retry error. 
Retry limit not exceeded. 


Responder returned RNR NAK. 
Requester may retry the request. 


RNR-NAK 


RC: Class A 
RD: Class A 
else: NA 


RNR NAK Retry error 
Retry limit exceeded. 


Excessive RNR NAKs returned by the 
responder. 

Requester retried the request "n" times, but 
received RNR NAK each time. 


locally detected error 


RC: Class B 
RD: Class B 
else: NA 


Unsupported OpCode. 


Responder detected an unsupported 
OpCode. 


NAK-lnvalid Request 


RC: Class B 
RD: Class B 
else: NA 


Unexpected OpCode. 


Responder detected an error in the 
sequence of OpCodes such as a missing 
"Last" packet. Note: there is no PSN error, 
thus this does not indicate a dropped 
packet. 


NAK-lnvalld Request 


RC: Class B 
else: NA 


Local Memory Protection 
Error. 


Requester detected an implementation 
specific memory protection error in its local 
memory subsystem. 


locally detected error 


All: Class B 


R_Key Violation 


Responder detected an invalid R_Key while 
executing an RDMA Request 


NAK-Remote 
Access En'or 


RC: Class B 
RD: Class B 
else: NA 
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Table 56 Requester Side Error Behavior 



Error 


Description 


Syndrome 


Requestor Fault 
BehaviorClass 


Remote Operation Error 


Responder encountered an error, (local to 
the responder), which prevented it from 
completing the request. 


NAK-Remote Opera- 
tion Error 


RC: Class B 
RD: Class B 
else NA 


Local Operation Error^ - 
affiliated 


An error occurred in the requester's local 
channel interface that can be associated 
with a certain WQ or EEC. 


locally detected error 


All: Class B 


Local Operation Error^ - 
unaffiliated 


An error occurred in the requester's local 
channel interface that cannot be associated 
with a certain WQ or EEC. 


locally detected error 


All: Class C 


Local RDD Violation 


Requester's EE Context detected an invalid 
RDD on an outbound packet 


locally detected error 


RD: Class B 
else NA 


Remote RDD Violation 


Responder's Receive Queue detected a 
RDD violation 


NAK-lnvalid RD 
Request 


RD: Class B 
else NA 


Remote Q_Key Violation 


Responder's Receive Queue detected a 
Q_Key violation 


NAK-lnvalid RD 
Request 


RD: Class B 
else NA 


Length error 


RDMA READ response message contained 
too much or too little payload data. 


locally detected error 


RC: Class B 
RD: Class B 
else NA 


Bad response 


Unexpected opcode for the response 
packet received at the expected response 
PSN.^ 


locally detected error 


RC: Class B 
RD: Class B 
else NA 


Ghost Acknowledge 


Requester received an acknowledge mes- 
sage at other than the expected response 
PSN. 


locally detected error 


RC: Class E 
RD: Class E 
Else NA 


CQ overflow 


Despite actual execution of the message, 
and acknowledgement, the completion noti- 
fication could not be written to the CQ. 


locally detected error 


All: Class F 



a. Local operations errors tend to be very implementation specific; 

b. For example; RDMA read instead of Acknowledge, NAK code in 

READ Response last" instead of middle. 



not all CA's may have or detect these. 
AETH of an RDMA read, or "RDMA 



Table 57 Summary of Requester Fault Behavior Classes 



Fault Behavior 
Class 


Current Send Queue 
WOE 


Subsequent Send 
Queue WQEs 


Final Send 
Queue State 


Requester Class A 


no Impact 


no impact 


no change 


Requester Class B 


completed in error 


flushed 


error state 


Requester Class C 


no impact - unaffiliated 


no impact 


no change^ 


Requester Class D^ 


completed in error 


flushed 


error state 


Requester Class E^ 


no impact 


no impact 


no change 


Requester Class F 


unknown 


unknown 


error state 



a. It is possible that this class of error will render the entire HCA unable to 
continue work. 
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b. Classes B and D are similar, however Class D applies to reliable datagram 1 

service only and also specifies that the requester's EE Context transition 2 
to the error state. 

c. Classes A and E look similar, but Class A requires a retry. Class E results in 3 

no action. 4 
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9.9.2.4.1 Requester Class A Fault Behavior 

Class A errors are those that are recoverable by the transport through a 
retry mechanism. If the retry succeeds, there is no visible impact to the 
transport's client (e.g. verbs layer). 
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9.9.2.4 Requester Side Error Response 1 

There are five different sets of error response behaviors that the requester 2 
must implement. Which behavior is executed for any given error is shown 3 
above in Table 56. This section specifies the error response behaviors. 4 

5 
6 
7 
8 
9 

The only Class A errors are Packet Sequence Error, Implied NAK se- 10 
quence error, Packet Timeout error and an RNR NAK. Packet Sequence 1 1 
Error and RNR NAK are both remotely detected. A Packet Timeout error 12 
and an Implied NAK sequence error are detected locally by the requester. ^ 3 

14 

C9-212: For an HCA using Reliable Connection service, each time the 

transport retries a Requester Class A error, it shall decrement a retry ^ ^ 

counter. There is one retry counter associated with Packet Sequence Er- 16 

rors and Packet Timeout errors, and a different retry counter associated 17 

with RNR NAKs. As long as the retry count has not expired the transport 1 8 
may continue to retry these errors. The protocol for retrying these errors 

is given in Section 9.7 Reliable Service on page 238 . 2o 

91 

09-153: If a TCA requester implements Reliable Connection service, or if 

a CA requester implements Reliable Datagram service, each time the re- 22 

quester retries a Requester Class A error, it shall decrement a retry 23 

counter. There is one retry counter associated with Packet Sequence Er- 24 

rors and Packet Timeout errors, and a different retry counter associated 25 

with RNR NAKs. As long as the retry count has not expired the transport 25 

may continue to retry these errors. The protocol for retrying these errors 

is given in Section 9.7 Reliable Service on page 238 . 

28 

09-213: For an HCA requester using Reliable Connection service, since 29 
Requester Class A errors are recoverable, the requester shall not report 30 
them to the transport's client unless the retry count expires. 31 

32 

o9-154: If a TCA requester implements Reliable Connection service, or if ^3 
a CA requester implements Reliable Datagram service, since Requester 
Class A errors are recoverable, the requester shall not report them to the 
transport's client unless the retry count expires. 35 

36 
37 
38 
39 
40 
41 
42 
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See Section 10.10.2.2 Completion Errors on page 434 for a discussion of 
how errors are reported for an HCA once the retry count has expired. 



IF (packet sequence error or timeout error) 


THEN decrement Error retry counter 


IF (RNR NAK) 




THEN decrement RNR NAK retry counter. 


IF -(packet sequence error or timeout error or RNR NAK) 


THEN reload retry counters 








IF (Error retry counter expired) 


If (RC mode) GOTO Class B 




Else GOTO Class D 




IF (RNR NAK retry counter expired) 


GOTO Class B 





Figure 107 Requester Class A Fault Behavior 

9.9.2.4.2 Requester Class B Fault Behavior 

C9-214: In response to a Requester Class B error, the requester shall 
complete the current WQE in error, transition the Send Queue to the error 
state and mark any subsequent WQEs posted to the Send Queue as 
flushed. 

For an HCA, the error is posted as "Completion - Processing type" with 
the appropriate error type (See 10. 10.2.2 Completion Errors on page 434 
for more details). 

The queue shall be transitioned to the error state by the transport layer to 
prevent a race condition that can occur if the client (e.g. the verbs layer 
for an HCA) posts further WQEs to the Send Queue before it discovers 
that an error has occurred. This is consistent with the Send Queue state 
diagram as shown in Figure 115 QP/EE Context State Diagram on pace 
385Refer to. 

Finally, all WQEs in the Send Queue behind the failed WQE are also com- 
pleted with the "Completed - Flushed in Error" status. 

For RC mode, note that some of these requests may have been com- 
mitted to the wire by the requester, and may even have been executed 
and completed by the responder. It is not possible to prevent this since the 
responder may have executed the request before the requester detects a 
local error. Therefore, the responder's local state must be considered un- 
known. 
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Currently active WQE is completed In error 



Send Queue Is transitioned to the Error State 



Subsequent WQEs (those behind the failed WQE in 
the queue) are completed with the "Completed - 
Flushed in Error" status 



Figure 108 Requester Class B Fault Behavior 
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For reliable datagram service, the requester's EE Context terminates the 1 

current message transfer, signals the error to the currently scheduled 2 

Send Queue, and removes the currently scheduled Send Queue from the 3 

scheduler. The EE Context then schedules the next Send Queue re- ^ 
questing service and proceeds. The Send Queue which caused the error 
behaves as described above. 

6 

09-21 5: While the Send Queue is in the error state, it must silently discard 7 

any acknowledge messages that arrive. 8 

9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 

9.9.2.4.3 Requester Class C Fault Behavior 20 

Since a Class C error cannot be associated with any particular WQE, it is 21 
not possible to mark a specific WQE as completed in error. For an HCA, 22 
the error posted is, "Asynchronous - Affiliated Error". 23 

24 

09-216: If the Requester Class C error can be associated with a QP, the 
Send Queue shall be transitioned to the error state, and all uncompleted 
WQEs are completed with the "Completed - Flushed in Error" status. 26 

27 

o9-155: If a CA requester implements Reliable Datagram service, and if 28 
the Requester Class C error can be associated with an EE Context, its 29 
send side shall be transitioned to the error state, and for an HCA, the error 
posted is, "Asynchronous - Affiliated Error". See 10.10.2.2 Completion Er- 
rors on page 434 for more details on HCA error reporting. 

32 

9.9.2.4.4 Requester Class D Fault Behavior 33 

A Class D error only occurs for reliable datagram service. 34 

35 

09-156: If a CA requester implements Reliable Datagram service, it shall 35 
behave as follows. For the Requester Class D error class, the transport 27 
shall transition the requester's EE Context to the error state, terminate the 
current message transfer, signal the error to the currently scheduled Send 
Queue, and de-queue the currently scheduled Send Queue. While re- 
maining in error state, the EE Context continues to transition to error state 40 
any other Send Queue requesting service. 41 

42 
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Each Send Queue (QP) behaves as though for a Class B error; it marks 
its current WQE as completed in error, transitions the QP to the error 
state, and flushes all subsequent WQEs. 



Currently active WQE is completed in error 



EEC Send Side Is transitioned to the Error State 



All Send Queues currently and subsequently linked to 
the EEC Send side are transitioned to error state 



Subsequent WQEs (those behind the failed WQE in 
each Send Queue in error state) are completed with 
the "Completed - Flushed in Error" status 



Figure 109 Requester Class D Fault Behavior 
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These errors are not reported to upper layers. 



A ghost acknowledge message is an acknowledge message that has 
been in the fabric long enough that it has survived the destruction of a con- 
nection and the subsequent establishment of a new connection. 
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9.9.2,4,5 Requester Class E Fault Behavior 1 

A Class E error occurs when the requester receives an acknowledge mes- 2 
sage with a PSN which does not match its expected PSN. These errors 3 
occur only for reliable classes of service. 4 

5 
6 

An acknowledge message with an unexpected PSN is presumed to rep- 
resent a "ghost" acknowledge message, or a duplicate acknowledge mes- 8 
sage. 9 

10 
11 
12 
13 

A duplicate acknowledge message occurs when the requester, believing "l^ 
that its original request message is lost in the fabric, re-sends the request 1 5 
message. If both request messages eventually arrive at the responder, 16 
the responder may generate an acknowledge message for each of them. -| 7 

1 8 

C9-217: For an HCA requester using Reliable Connection service, in re- 
sponse to a Requester Class E error, the requester shall drop the ac- 
knowledge message. There is, however, an exception to this rule. For 20 
reliable connected service, a duplicate acknowledge message may be 21 
used by the responder to carry end-to-end flow control credits to the re- 22 
quester (an "unsolicited acknowledge"). Thus, if the PSN of the acknowl- 23 
edge message is one less than the requester's expected PSN, the 24 
requester must recover the end-to-end credits and discard the remainder 
of the message. This behavior is detailed in section 9.7.7.2 End-to-End 
(Message Level) Flow Control on page 296 . 

27 

o9-1 57: If a TCA requester implements Reliable Connection service, or if 28 

a CA requester implements Reliable Datagram service, in response to a 29 

Requester Class E error, the requester shall drop the acknowledge mes- 30 

sage. 3^ 

It should be noted that even if the Acknowledgment was an actual ghost, 
with wrong credits, the credit mechanism would eventually recover with no 33 
errors reported to the upper layers. 34 

35 
36 



9,9.2.4.6 Requester Class F Fault Behavior 



09-218: A Requester Class F error occurs when the CQ is inaccessible 37 
or full and an attempt is made to complete a WQE. The Affected QP shall gg 
be moved to the error state and an affiliated asynchronous error gener- 
ated. The current WQE and any subsequent WQEs are left in an un- 
known state. See 10.10.2.3 Asvnchronous Errors on oaoe 435 . 

41 
42 
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9.9.3 Responder Side Behavior^ 



Table 58 lists the errors that must be detected by the responder, and the 
Fault Behavior Class for each error. The Fault Behavior Class specifies 
whether the responder returns a NAK code, whether the error is reported 
to the local client, and the subsequent behavior of the Receive Queue. 
The syndrome column lists the NAK code that is returned to the requester. 

For convenience, a summary of the fault behavior classes is shown in 
Table 59 Summarv of Responder Fault Class Behaviors on page 351 . 

The error detection for reliable service is described in section 9.7 Reliable 
Service on page 238 . and error detection for unreliable service is specified 
in section 9.8 Unreliable Service on page 315 . 

09-219: For the implemented subset of transport services, responders 
shall conform to the error behavior as specified in Table 58. Also, the re- 
sponder's Receive queue shall behave as specified for each Fault Be- 
havior Class shown in Table 59 and each of the sub-sections below. 



Table 58 Responder Error Behavior Summary 



Error 




Service 


Syndrome 


Behavior 
,,\:'/;v Class :V;',:;;;:.'\ 


Malformed WQE 


Responder detected a malformed 
Receive Queue WQE while processing 
the packet. 


RC, RD 


NAK-Remote Operational Error 


ResDonder Class A 


Else 


NA 


ResDonder Class A 


Unsupported or 

Reserved 

Opcode 


Inbound request OpCode was either 
reserved, or was for a function not sup- 
ported by this QR E.G. RDMA or ATOMIC 
on QP not set up for this. For RC this is 
"QP Async affiliated" 


RC 


NAK-lnvalid Request 


Responder Class C 


RD 


ResDonder Class 


else 


NA 


ResDonder Class D 


Misaligned 
ATOMIC 


VA does not point to an aligned address 
on an atomic operation 


RC 


NAK-lnvalid Request 


Responder Class C 


RD 


ResDonder Class B 


Too many RDMA 
READ or ATOMIC 
Requests 


There were more requests received and 
not ACKed than allowed for the connec- 
tion 


RC 


NAK-lnvalid Request 


ResDonder Class C 


RD 


Responder Class B 


Out of Sequence 
Request Packet 


PSN of the inbound request is outside the 
responder's valid PSN window. 


RC, RD 


NAK-Sequence error 


Responder Class B 


UC 


NA 


Responder Class D 


Out of Sequence 
OpCode, current 
packet is "first" or 
"Only" 


The Responder detected an error in the 
sequence of OpCodes; a missing "Last" 
packet 


RC 


NAK-lnvalid Request 


ResDonder Class C 


RD 


NA 


ResDonder Class E 


UC 


NA 


ResDonder Class D1 
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1 . For Unreliable services, a better title might be Receiver side Behavior. 
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Table 58 Responder Error Behavior Summary 



Error 


Description 




CwnrirAiTiA 
wyiiuiviiic 


rauii 
. Class 


Out of Sequence 
OpCode, current 
packet is not "first" 

Kji \yi iiy 


The Responder detected an error in the 
sequence of OpCodes; a missing "First" 
packet 




NAK-lnvalid Request 


Raonr%nHor Place P 

rvesponuer oiaos o 


RD 


Responder Class B 


UC 


NA 


ResDonder Class D 


R_Key Violation 


Responder detected an R_Key violation 
while executing an RDMA request. 




KIAl^ Dom^tQ A/*«^acc \/!nlofi/^n 


R^on^nrlor Place P 

rvesponuer oiass o 


RPi 


KJAl^ Ram/^ta A^/^occ \/!rtlotirto 


RAcnnnHor Place R 


UC 


NA 


ResDonder Class D 


Local QP Error 


Responder detected a local QP related 
error while executing the request mes- 
sage. The local error prevented the 
responder from completing the request. 


RC. RD 


NAK-Remote Operational Error 


ResDonder Class A 


Else 


NA 


Responder Class A 


Q_Key Violation 


Responder's Receive Queue detected an 
invalid Q_Key in the request message^ 


RD 


NAK-lnvalid RD Request 


ResDonder Class B 


UD 


NA 


Responder Class D 


Packet Header 
Violation 


Responder detected a header violation 
that requires a silent drop as described in 
9.6 Packet Transport Header Validation 
on paqe 228 


RC, RD 
UC, UD 


none 


Responder Class D 


RDD Violation 


Responder's Receive Queue detected an 
invalid RDD 




iNAtN-invaiiQ Ku Kequesi 


KesDonaer uiass o 


Invalid Dest QP 


Dest QP does not exist or is not config- 
ured for RD service 


RD 


NAK-lnvalid RD Request 


Responder Class B 


Resources Not 
Ready Error 


A WQE or other resource is not currently 
available. 


RC.RD 


RNR-NAK 


ResDonder Class B 


Length errors 


1) Inbound "Send" request message 
exceeded the responder's available 
buffer space: "Local Length Error" 

2) RDMA WRITE request message con- 
tained too much or too little payload data 
compared to the DMA length advertised 
In thf* fir^t nr onlv naf kpt 

II 1 11 1^ III wl Kfl lly L/dwlx^l. 

3) Payload length was not consistent with 
the opcode: 

a: 0 byte <= "only" <= PMTU bytes 
b: ("first" or "middle") == PMTU bytes 
c: 1 byte <= "last" <= PMTU bytes 


RC 


NAK-lnvalid Request 


ResDonder Class C 


RD 


ResDonder Class F 


UC, 

UD(only 
1,3a) 


NA 


Responder Class D 


Invalid duplicate 
ATOMIC Request 


A duplicate ATOMIC request packet is 
received, but the PSN does not match the 
PSN of a saved ATOMIC Request. 


RC. RD 


none 


ResDonder Class D 


CO overflow 


Despite actual execution of the message, 
and acknowledgement, the completion 
notification could not be written to the 
CQ. 


All 


none 


ResDonder Class G 



a. The responder may return either Class B or C, depending on Its ability to reuse the affected WQE (if any), 

b. Q_Key violations require the incrementing of a counter and a potential trap as described in 10.2.4 Q Keys on oaae 376 
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Table 59 Summary of Responder Fault Class Behaviors 





(M^(§0lte©[M!EDQ3il 






POKED raaasfts© 


Responder Class A 


For reliable services: 
Remote Operational Error 


WQE completed in error 


flushed 


error state 


Responder Class B 


Invalid Request 
Remote Access Violation 
Sequence error 
RNR-NAK 


no WQE consumed 


no impact 


no change 


Responder Class C 


Invalid Request 


completed in error 


flushed 


error state 


Responder Class D, D1 


none 


no WQE consumed 


no impact 


no change 


Responder Class E 


none 


completed in error 


no impact 


no change 


Responder Class F 


Invalid Request 


completed in error 


no impact 


no change 


Responder Class G 


none 


unknown 


unknown 


error state 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



a. A WQE is only completed if open for Sends and RDMA WRITE with Immediate data. 

9.9.3.1 Responder Side Error Response 

There are a total of seven classes of fault behavior described for the re- 
sponder side. The fault behaviors are grouped according to whether or not 
an error is reported to the client on the responder side, whether or not the 
error is reported to the requester via a NAK code, and whether or not a 
WQE is consumed from the Receive Queue. 
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9.9.3.1.1 Responder Class A Fault Behavior 

Class A errors are traceable to a poorly formed or invalid WQE, or otiier 
error associated with the receiver QP. These errors are not caused by the 
sender. 

C9-220: For a Responder Class A error, the error shall be reported to the 
responder's client, the QP is placed into the error state, and, for reliable 
services, a NAK-Operation Error is generated. 

For Reliable Datagram service, the EEC continues operation. 

If the responder is an HCA, these errors are reported to the verbs layer as 
a "Completion error - interface type", "Internal Consistency error". See 
10.10.2 Error Handling Mechanisms on page 434 for a discussion of Com- 
pletion errors. 

If the responder detects a Class A error, its behavior is as follows: 



For reliable services, NAK-Operation Error returned to 
the requester 



Currently active receive WQE is Completed in Error 



Send and Receive Queues are transitioned to the 
En'or State 



All other WQEs on both queues, and all WQEs subse- 
quently posted to either Queue, are completed with 
the "Completed - Flushed in Error" status 



Figure 110 Transport Class A Responder Behavior 
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9.9.3.1 .2 Responder Class B Fault Behavior 1 

Class B errors are reported to the requester, but are not reported to the 2 
responder's local client. 3 

4 

C9-221 : For an HCA requester using Reliable Connection service, and for ^ 
a Responder Class B responder side error, the transport shall generate a 
NAK code, but shall not consume a WQE from the Receive Queue or 
transit the receive queue to the error state. ^ 

8 

09-158: If a TCA responder implements Reliable Connection service, or if 9 
a CA responder implements Reliable Datagram service, it shall behave as i q 
follows. For a Responder Class B responder side error, the transport shall 
generate a NAK code, but shall not consume a WQE from the Receive 
Queue or transit the receive queue to the error state. 



Appropriate NAK code is returned to the requester 



Resume waiting for a valid inbound request packet 



11 
12 
13 



Note that this fault behavior class is limited to reliable services only. 1^ 

15 

If the responder detects a Class B error, it behaves as follows: 16 

17 
18 
19 
20 
21 

Figure 111 Responder Class B Fault Behavior 22 

23 
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9.9.3.1.3 Responder Class C Fault Behavior 

C9-222: For an HCA responder using Reliable Connection service, for a 
Class C responder side error, the error shall be reported to the re- 
sponder's client and the QP is placed into the error state. A Class C error 
shall also be reported to the requester by generating a NAK-lnvalid Re- 
quest. In the case of an HCA, the error reported in a Receive queue com- 
pletion is a "Completion - Process Error". If the current operation does 
not use a receive WQE, then an affiliated asynchronous error is gener- 
ated. 

09-159: If a TCA responder implements Reliable Connection service, for 
a Class C responder side error, the error shall be reported to the re- 
sponder's client and the QP is placed into the error state. A Class C error 
shall also be reported to the requester by generating a NAK-lnvalid Re- 
quest. 

The Receive Queue's behavior is as follows: 



Current WQE (if any) is completed in error. An appro- 
priate error code is returned to the upper layer 



Appropriate NAK code is generated 



Send and Receive Queues are transitioned to the 
error state. New inbound requests are dropped. 



All other WQEs on both queues, and all WQEs subse- 
quently posted to either Queue, are completed with 
the "Completed - Flushed in Error" status 



Figure 112 Transport Class C Receive Queue Behavior 



See Section 10.10.2.2 Completion Errors on pace 434 for more details on 
HCA error reporting. 

9.9.3.1.4 Responder Class D Fault Behavior 

09-223: An inbound request packet which causes a Responder Class D 
error shall cause the Transport to respond as specified in 9.9.3.1.4: Re- 
sponder Class D Fault Behavior . 

In this case the transport shall: 
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21 
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• silently drop the packet 1 

Not generate an ACK or NAK code to the requester 2 

3 

• Not notify the responder's client ^ 

• terminate the current message without consuming the current receive ^ 
WQE (if any) 

6 
7 
8 



wait for the first packet of a new message (For reliable services, the 
new message must begin at the expected PSN.) 



9.9.3.1.5 Responder Class D1 Fault Behavior 



The Current WQE (if any) is reset to accept the next incoming Send or g 
RDMA WRITE with Immediate message. 

"Current message" means all the packets received since the most re- 
cently received "first" or "only" OpCode, including the present packet 
(which caused the Class D error). 13 

14 
15 
16 
17 
18 
19 
20 

09-224: For an HCA responder using Unreliable Connection service, an 21 
inbound request packet which causes a Responder Class D1 error shall 22 
cause the Transport to respond as specified in 9.9.3.1.5: Responder 23 



A "new message" is denoted by a packet with a BTH opcode of "first" or 
"only". 



An inbound request packet which causes a Class D1 error only occurs in 
Unreliable Connection mode. 



Class D1 Fault Behavior. 



24 



In this case the transport shall: 

26 

• silently drop the packet 27 
Not notify the responder's client ^8 

• terminate the current message without consuming the current receive 
WQE (if any) 

• wait for the first packet of a new message (which may be greater than 
the expected PSN.) 

If the present packet, (which caused the Class D1 error) has a BTH op- 
code of "first" or "only"; it shall be treated as the first packet of a new mes- 
sage. 



29 
30 
31 
32 
33 
34 
35 
36 

The Current WQE (if any) shall be reset to accept the next incoming Send 37 
or RDMA WRITE with Immediate message. 38 

39 

"Current message" means all the packets received since the most re- 
cently received "first" or "only" OpCode, excluding the present packet 
(which caused the Class D1 enror). 

42 
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A "new message" is denoted by a packet with a BTH opcode of "first" or 1 
"only". 2 

3 

o9-160: If a TCA responder implements Unreliable Connection service, it ^ 
shall conform to the Class D1 HCA responder behavior described in the 
preceding compliance statement. 

6 
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9.9.3.1.6 Responder Class E Fault Behavior 

This fault class is intended primarily for services where a failure of a par- 
ticular request packet should not impact the ability of the Receive Queue 
to continue receiving messages. 

09-161: If a CA implements Reliable Datagram service, then a Responder 
Class E error shall cause the responder to mark the current WQE (if any) 
as completed in error "Remote Work Request Error", and receive the new 
inbound message. The Receive Queue shall continue operation without a 
transition to the error state.: 



Current WQE (if any) is completed in error. An appro- 
priate error code Is written to the WQE 



Resume waiting for a valid inbound request packet 



Figure 113 Transport Class E Receive Queue Beliavior 

9.9.3.1.7 Responder Class F Fault Behavior 

A Class F error only occurs due to an invalid request length in Reliable Da- 
tagram service. 

09-162: If a CA implements Reliable Datagram service, then a Responder 
Class F error shall be reported both to the requester and (when a receive 
WQE is involved) to the responder's client. The Transport shall return the 
"NAK-lnvalid request" to the requester, and complete the current WQE (if 
any) in error. The Receive Queue shall continue operation without a tran- 
sition to the error state. 

In the case of an HCA, the error reported is a "Completion - Process 
Error". 

Both the EEC and destination QP remain in operation. 
The Receive Queue's behavior is as follows: 



Current WQE (if any) is completed in error. An appro- 
priate en-or code is written to the WQE 



Appropriate NAK code is generated 



Resume waiting for a valid inbound request packet 



Figure 114 Transport Class F Receive Queue Behavior 
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9.9.3.1.8 Responder Class G Fault Behavior 1 

A Class G error occurs when the CQ is inaccessible or full and an attempt 2 
is made to complete a WQE. 3 

4 

C9-225: A Responder Class G error occurs when the CQ is inaccessible g 
or full and an attempt is made to complete a WQE. The Affected QP shall 
be moved to the error state and an affiliated asynchronous error gener- 
ated. The current WQE and any subsequent WQEs are left in an un- 
known state. 8 

9 

See 10.10.2.3 Asynchronous Errors on page 435 . 1 q 
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9.10 Header and Data Field Source 

9.10.1 Field source when generating packets 

The following tables provide an indication of the source of the various 
header and data fields in the data packets for the various IBA services. 
The following terms are used in the table: 

Link This indicates the value is attached to the packet based on either a fixed 
value, or values dependent on the service, or values looked up based on 
parameters loaded into the logical port. Done by the link layer. 

Tr This indicates that the value is fixed or calculated by the transport layer. 

QP This indicates that the value is derived from the QP context 

EE This indicates that the value is derived from the EE context 

QP+EE This indicates that the values are derived from the QP and EE contexts 

NA Not Applicable 

WQE The value is directly or indirectly (via Address vector) derived from infor- 
mation in the WQE 

Table 60 Packet Fields and Parameters by Service 



Parameter 


Description 


RC 


UC 


RD 


UD 


Raw 
IP 


Raw 
ET 


LRH VL 


The VL to use for requests. Based on SL and the port 
SL to VL mapping table. 


link 


link 


link 


link 


link 


link 


LRH LVer 


The version of the link level. This field depends on the 
revision of the device. 


link 


link 


link 


link 


link 


link 


LRH SL 


The SL to use for requests 


QP 


QP 


EE 


WQE 


WQE 


WQE 


LRH LNH IBA 


IBA transport bit, indicates that BTH follows 


1 


1 


1 


i 


0 


0 


LRH LNH GRH 


GRH bit, indicates that a GRH follows 


QP 


QP 


EE 


WQE 


1 


0 


LRH DLID 


Destination local ID used for routing 


QP 


QP 


EE 


WQE 


WQE 


WQE 


LRH Packet Length 


Length of the local packet; calculated by the transport 
based on the message length. 


WQE 


WQE 


WQE 


WQE 


WQE 


WQE 


LRH SLID (high bits not 
covered by LMC) 


Source local ID in outgoing packets. From the port. 
With LMC low order bits (Os) added, the value is 
called "Base LID". 


link 


link 


link 


link 


link 


link 


LRH SLID (low bits cov- 
ered by the LMC) 


Source logical ID in outgoing packets. These LMC (as 
a number) bits are called the "path" bits. 


QP 


QP 


EE 


WQE 


WQE 


WQE 


GRH IPVer 


CA's set to 6 


Tr 


Tr 


Tr 


Tr 


Tr 


NA 


GRH Tclass 


CA's set to 0; it will then be loaded with another value 
at the first encountered router. 
Alternately set according to application. 


QP 


QP 


EE 


WQE 


WQE 


NA 


GRH FlowLabel 


CA's set to 0; it will then be loaded with another value 
at the first encountered router. 
Alternately set according to application. 


QP 


QP 


EE 


WQE 


WQE 


NA 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 359 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Transport Layer 



October 24, 2000 
FINAL 



Table 60 Packet Fields and Parameters by Service 















^RfaVv^ 




GRH Paylen 


Length of the global packet; calculated by the trans- 
port based on the message length. 


WQE 


WQE 


WQE 


WQE 


WQE 


NA 


GRH NxtHdr 


CA's set to IBA (value [awaiting IETF assignment]) 


Tr 


Tr 


Tr 


Tr 


WQE 


NA 


GRH HopLmt 


CA's set to 0; it will then be loaded with another value 
at the first encountered router. 
Alternately set according to application. 


QP 


QP 


EE 


WQE 


WQE 


NA 


GRH SGID 


Source Global ID, from the port table and the index 
found in: 


QP 


QP 


EE 


WQE 


WQE 


NA 


GRH DGID 


Destination Global ID 


QP 


QP 


EE 


WQE 


WQE 


NA 


BTH Opcode 


Depends on operation, set by the transport layer. 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH TVer 


The version of the transport. This field depends on the 
revision of the device (0). 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH P_Key 


Partition Key from the port table and the Index found 
in: 


QP 


QP 


EE 


QP 


NA 


NA 


BTH DestQP 


Destination QP 


QP 


QP 


WQE 


WQE 


NA 


NA 


BTH Pad 


Length of packet pad; used to calculate actual data 
size. Calculated by the transport layer based on data 
size. 


WQE 


WQE 


WQE 


WQE 


NA 


NA 


BTH SE 


Solicited Event 


WQE 


WQE 


WQE 


WQE 


NA 


NA 


BTH M 


Migrate. Set by the transport dependent on the migra- 
tion state. 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH AckReq 


Acknowledge request 


Tr 


0 


Tr 


0 


NA 


NA 


BTH PSN 


Packet Sequence Number 


QP 


QP 


EE 


QP 


NA 


NA 


RDETH EEC 


Destination EE Context 


NA 


NA 


EE 


NA 


NA 


NA 


DETH Q_Key 


Key which protects datagram QPs 


NA 


NA 


WQE 


WQE 


NA 


NA 


DETH Source QP 


Source QP. Set by transport for datagram services. 


NA 


NA 


Tr 


Tr 


NA 


NA 


RETH 


All fields of the RDMA Extended Transport Header 
(when used) are taken from the WQE 


WQE 


WQE 


WQE 


NA 


NA 


NA 
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Table 60 Packet Fields and Parameters by Service 















OP 


1^ 


AtomicETH 


All fields of the ATOMIC Extended Transport Header 
(when used) are taken from the WOE 


WUc 


kl A 

NA 


WQE 


NA 


NA 


NA 


AETH MSN 


Message Sequence number (ACKs only) 


OP 


NA 


0 


NA 


NA 


NA 


AFTH Svndromf* 

III wyiiurwiiiw 


Acknowledge syndrome, computed based on opera- 
tion for reliable services 


QP 


NA 


EE+ 
QP 


NA 


NA 


NA 


AETH RNR-NAK timer 
(TTTTT) 


This value is placed in the AETH.TTTTT field when 
sending an RNR NAK. It denotes the minimum time to 
wait before retrying the request. 


QP 


NA 


QP 


NA 


NA 


NA 


AETH credit count 
(CCCCC) 


This value is placed in the AETH.CCCCC Tietd when 
sending an Ack in RC mode. It denotes the number of 
receive WQEs available to receive Send or RDMA 
write with immediate messages. 


QP 


NA 


Kl A 

NA 


NA 


Kl A 

inA 


MA 

NA 


AtomicAckETH 


ATOMIC data returned; the data is loaded as defined 
by the R_Key and Virtual Address, stored per WOE 


WQE 


NA 


WQE 


NA 


NA 


NA 


Immediate data 


Dependent on operation 


WOE 


WQE 


WQE 


WQE 


NA 


NA 


Payload 


Dependent on operation 


WQE 


WQE 


WQE 


WQE 


WQE 


WQE 


ICRC 


Calculated by transport; data dependent 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


VCRC 


Calculated by Link layer; data dependent 


link 


link 


link 


link 


link 


link 
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9.10.2 Transport Connection Parameters 



The following are not sent "on the wire" but are needed to implement the 
protocol. This table is included to provide a better understanding of the pa- 
rameters used by the transport layer to provide a connection. This list only 
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covers elements mentioned in the IBA specification, other elements may 
be needed to completely implement connections. 

Table 61 Connection Parameters by Transport Service 



1 
2 
3 
4 



Parameter 


Description 


RC 


UC 


RD 


UD 


Raw 

IB ' 
IP 


Raw 
ET 


Connect state 


state of connection (Reset, RTR, RTS, Error etc.) 


QP 


QP 


EE+ 
QP 


QP 


QP 


QP 


Port number 


Used only if there is more than a single port 


QP 


QP 


EE 


QP 


QP 


QP 


Global/Local header 


Determines if global header is to be attached or not. 


QP 


QP 


EE 


WQE 


NA 


NA 


IVITU 


Size of the packets allowed on this connection. 


QP 


QP 


EE 


WQE 


WQE 


WQE 


RNR NAK retry time 


Time sent to requestor when signaling RNR 


Tr 


NA 


Tr 


NA 


NA 


NA 


RNR Retries 


Send Queue retry count for RNR 


QP 


NA 


EE 


NA 


NA 


NA 


ACK_Timeout 


The maximum delay before declaring an ACK as 
"lost" 


QP 


NA 


EE 


NA 


NA 


NA 


Error Retries 


Send Queue retry count for sequence or time-out 
errors 


QP 


NA 


EE 


NA 


NA 


NA 


MigState 


Migration State (Migrated, Armed, ReArm) 


QP 


QP 


EE 


QP 


NA 


NA 


Disable_E2E_Gredits 


Send queue use E2E protocol (depends on remote 
Slue s auMiiy lo senu creuiis^ 


QP 


NA 


NA 


NA 


NA 


NA 


Path QnooH /IPn\ 

rain opeeu \\rLf) 


v^unirois paCKei ernisoion ror siower iiriKo 


OP 


OP 




wnp 


WOP 


WOP 

vv wc 


on 


rroieciion uomain ror inis vjr 




HP 


np 


np 


np 


np 




r\eiiaDie uaiagram uomain 


MA 


MA 
INM 


EE 


MA 
INM 


MA 
INM 


MA 
INM 


XmitPSN 


Sequence number used when sending 


QP 


QP 


EE 


QP 


NA 


NA 


AckPSN 


Sequence number expected for the ACKs 


QP 


NA 


EE 


NA 


NA 


NA 


Rx ePSN 


Sequence number expected when receiving 


QP 


QP 


EE 


NA 


NA 


NA 


RxAckPSN 


Number of unacknowledged Rx packets 


QP 


NA 


EE 


NA 


NA 


NA 


SSN 


Transmit messages Sent Sequence Number 


QP 


NA 


NA 


NA 


NA 


NA 


Rx MSN 


Message Sequence Number 


QP 


NA 


NA 


NA 


NA 


NA 


Rx credits 


Rx queue elements posted 


QP 


NA 


NA 


NA 


NA 


NA 


LSN 


Limit Sequence number (credit accounting) 


QP 


NA 


NA 


NA 


NA 


NA 


SchQP_dequeue 


QP at head of schedule queue (RD mode) 


NA 


NA 


EE 


NA 


NA 


NA 


SchQP_enqueue 


QP at tail of schedule queue (RD mode) 


NA 


NA 


EE 


NA 


NA 


NA 
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Table 61 Connection Parameters by Transport Service 



^^^^^ 
















SchQP_Next 


Pointer to next QP to be scheduled (RD mode) 


NA 


NA 


QP 


NA 


NA 


NA 


Num_RDMA_Reads 


Number of RDMA READs or ATOMICs supported by 
remote side 


QP 


NA 


1 


NA 


NA 


NA 


RDMARA/A/MH/Sz or 
ATOMIC "result" 


The "hidden" stored address(s) of RDMA READ 
request(s) or ATOMIC results 


QP 


NA 


EE 


NA 


NA 


NA 


RDMA PSN# or 
ATOMIC PSN # 


Sequence number of requested op, used to match 
response on a repeat, and store reply PSN 


QP 


NA 


EE 


NA 


NA 


NA 


RDMAR/ATOMIC Use 


Usage of the resource; 1=RDMAR, 0=ATOMIC 


QP 


NA 


EE 


NA 


NA 


NA 


Rx Completion Q 




QP 


QP 


QP 


QP 


QP 


QP 


Tx Completion Q 




QP 


QP 


QP 


QP 


QP 


QP 


Tx WOE pointer 


Points to current Send WQE and its data segments for 
requests 


QP 


QP 


QP 


QP 


QP 


QP 


Tx ACK WOE pointer 


Points to current Send WQE and its data segments for 
Completions 


QP 


QP 


QP 


QP 


QP 


QP 


Rx WQE pointers 


Points to current Receive descriptor 


QP 


QP 


QP 


QP 


QP 


QP 
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9.10.3 Packet Header and Data Field Validation 



The following tables provide an indication of the validation responsibility 
of the various header and data fields in the data packets for the various 
IBA services. The following terms are used in the table: 

Link This indicates the value is checked by the link layer. 

Tr This indicates that the value is checked against fixed values or used by 

the transport layer to select among choices. 

QP This indicates that the value is checked against values from the QP con- 
text 

EE This indicates that the value is checked against values from the EE con- 

text 

NC The value is Not checked 

NA Not Applicable 

WQE The value is checked against information derived from the WQE 

Table 62 Packet Fields Validation source by Service 



Parameter 


Description : 


RC 


UC 


RD 


UD 


Raw 
IP 


Raw 
ET 


lrh vl 


The VL on incoming packet. 


link 


link 


link 


link 


link 


link 


LRH LVer 


The version of the link level. This field depends on the 
revision of the device. 


link 


link 


link 


link 


link 


link 


LRH SL 


The SL to use for requests 


NC 


NC 


NC 


NC 


NC 


NC 


LRH LNH IBA 


IBA transport bit, indicates that BTH follows 


Tr(1) 


Tr(1) 


Tr(1) 


Tr(1) 


Tr(0) 


Tr(0) 


LRH LNH GRH 


GRH bit, indicates that a GRH follows 


QP 


QP 


EE 


Tr 


Tr(1) 


Tr(0) 


LRH DLID 


Destination local ID used for routing 

This is always checked at the link layer against Base 

LID and LMC. 


link 
QP 


link 
QP 


link 
EE 


link 


link 


link 


LRH Packet Length 


Length of the local packet; checked against PMTU at 
link, valid packet size at Transport, and data buffer 
size and protection values. 


WQE 


WQE 


WQE 


WQE 


WQE 


WQE 


LRH SLID 


Source local ID in ongoing packets. 


QP 


QP 


EE 


NC^ 


NC® 


NC® 


GRH IPVer" 


Checked for the value '6' 


Tr 


Tr 


Tr 


Tr^ 


Tr® 


NA 


GRH Tclass 


Traffic Class 


NC 


NC 


NC 


NC^ 


NC® 


NA 


GRH FlowLabet 


Flow label 


NC 


NC 


NC 


NC^ 


NC® 


NA 


GRH Paylen 


Length of the global packet; may be checked against 
PMTU and LRH Packet Length at link, valid packet 
size at Transport, and data buffer size and protection 
values. 


WQE 


WQE 


WQE 


WQE® 


WQE® 


NA 


GRH NxtHdr 


Checked for the value 'IBA' (value [awaiting IETF]) 


Tr 


Tr 


Tr 


Tr^ 


NC® 


NA 


GRH HopLmt 


Hop Limit 


NC 


NC 


NC 


NC® 


NC® 


NA 


GRH SGID 


Source Global ID 


QP 


QP 


EE 


NC® 


NC® 


NA 
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Table 62 Packet Fields Validation source by Service 



Parameter 


Description 


RC 


uc 


RD 


UD 


Raw 
IP 


Raw 
ET 


GRH DGID 


Destination Global ID 


QP 


QP 


EE 


NC^ 


NC^ 


NA 


BTH Opcode 


Depends on operation 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH TVer 


The version of the transport. 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH P_Key 


Partition Key; checked against the port partition table 
and an index in the:*^ 


QP 


QP 


EE 


QP 


NA 


NA 


BTH DestQP 


Destination QP; checked against the valid set and QP 
mode by transport. 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH Pad 


Length of packet pad; supplements LRH Packet 
Length. 


WOE 


WQE 


WQE 


WQE 


NA 


NA 


BTH SE 


Solicited Event; passed to upper layers for each mes- 
sage 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


BTH M 


Migrate. Checked and used by transport to select 
alternate path parameters 


Tr 


Tr 


Tr 


NC 


NA 


NA 


BTH AckReq 


Acknowledge request 


Tr 


NC 


Tr 


NC 


NA 


NA 


BTH PSN 


Packet Sequence Number 


QP 


QP 


EE 


NC 


NA 


NA 


RDETH EEC 


Destination EE Context; checked against the valid set 
and EE mode by transport. 


NA 


NA 


Tr 


NA 


NA 


NA 


DETH Q_Key 


Key which protects datagram QPs 


NA 


NA 


QP 


QP 


NA 


NA 


DETH Source QP 


Source QP. Passed to upper layers for each mes- 
sage. 


NA 


NA 


NC 


NC 


NA 


NA 


RETH 


All fields of the RDMA Extended Transport Header 
(when used) are validated against protection parame- 
ters associated with QP state. 


QP 


QP 


QP 


NA 


NA 


NA 


AtomicETH 


All fields of the ATOMIC Extended Transport Header 
(when used) are validated aaainst orotection oarame- 
ters associated with QP state. 


QP 


NA 


QP 


NA 


NA 


NA 


AETH MSN 




OP 


NA 


Tr 


NA 


NA 


NA 


AETH Svndrome 


Acknowledae svndrome 

r^^^rxi 1 wv i^vivj w oyiiuiuiii^ 


Tr 


NA 


Tr 


NA 


NA 


NA 


AtomicAckETH 


Atomic data returned; Passed to upper layers for each 
message. 


NC 


NA 


NC 


NA 


NA 


NA 


Immediate data 


Dependent on operation; Passed to upper layers for 
each message. 


NC 


NC 


NC 


NC 


NA 


NA 


Payload 


Dependent on operation; Passed to upper layers for 
each message. 


NC 


NC 


NC 


NC 


NC 


NC 


ICRC 


Checked by transport 


Tr 


Tr 


Tr 


Tr 


NA 


NA 


VCRC 


Checked by Link layer; data dependent 


link 


link 


link 


link 


link 


link 



a. For HCAs, this information is provided to upper layers. 

b. For QP1, the P_Key need only be a member of the port's Partition table, it is not checl<ed against a QP index.. 



9.11 Static Rate Control 



37 

38 
39 

As the traffic load increases in a fabric, resource contention increases. 
Congestion management is used to smooth operation, improve fabric ef- ^'^ 

42 
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ficiency, improve effective bandwidtli, and improve average packet la- 1 
tency in the face of such contention. 2 

3 
4 
5 
6 

Static rate control is the ability of an endnode to keep the rate of data 7 
sourcing into the fabric below a pre-configured value. 8 



There are a variety of mechanisms to address this problem. For this ver- 
sion of IBA, only static rate control will be defined. Future versions of the 
specification may provide definition of additional mechanisms. 



IBA provides three mechanisms to manage static rate control. 



9 

10 
11 



IBA supports Static Rate control in CAs to reduce congestion caused by 
a higher-speed CA injecting packets onto a path within a subnet at a rate 
that exceeds the path or destination CA's ability to sink. For example, a 
CA with a 10 Gbps link transmitting packets to a CA with a 2.5 Gbps link 
through an intermediate switch. In this case, the switch would be required 3 
to introduce "back-pressure" (limit the link-level flow control credits re- 14 
turned to the faster link) in order to prevent the slower link from being over- 1 5 
run. >j6 

17 
18 

• Device provided port rate information (see 16.2.3.1 ClassPortlnfo on ^ 9 
page 756 ) 20 

FM supported reporting of best possible rates for a source/destina- 

tion pair (see 15.2.5.16 PathRecord on pace 686 ). 22 

• CA "Inter Packet delay" parameters in the connection setup MADs 
(described below) 24 

25 

9.11.1 Static rate control for Heterogeneous Links 

26 

A channel adapter has the ability to limit the rate of packets injected. This 27 

rate is based on the subnet-local destination port. 

^ 28 

09-163: If a port can support injection into the fabric at a rate greater than 29 
2.5 Gbits/sec, this port shall provide static rate control as defined in this 30 
section. 31 

32 

The link rate supported is defined by the Portlnfo:LinkWidthSupported 
and Portlnfo:LinkSpeedSupported attributes. See Table 127 Portlnfo on 
page 634 for a description of these. 

35 

o9-1 64: If a port is configured for injection into the fabric at a rate greater 36 
than 2.5 Gbits/sec, it shall not schedule a packet for injection into the local 37 
subnet until a programmable amount of time has passed since the last 33 
packet was scheduled for injection from this source port to the same des- 39 
tination port. 

41 
42 
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The link rate configured is defined by the Portlnfo:LinkWidthActive and 1 
Portlnfo:LinkSpeedActive attributes. See Table 127 Portlnfo on page 634 2 
for a description of these. 3 

4 

In the above, destination port refers to the full DLID (i.e. base DLID plus 
path bits) of the destination port within the local subnet, even for globally ^ 
routed packets, while source port refers to the ingress port regardless of ^ 
SLID (I.e. applies to all the SLIDs associated with this port). 7 

8 

The time to wait before transmitting a subsequent packet is based on the g 
time it takes to transmit the current packet. ^ q 

11 

09-165: If a port can support injection into the fabric at a rate greater than 
2.5 Gbits/sec, the time to wait between scheduling packets destined for 
the same DLID and originating from the same port is determined by the 13 
Inter Packet Delay (IPD). Specifically, if a packet b is to be sent to the 14 
same DLID and using the same source port as packet a, then packet b 1 5 
shall not be scheduled until a time Tg has passed since packet a was ^ g 
scheduled, where Tg is calculated as: (IPD + 1) multiplied by the time it -^j 
takes to transmit the first packet. Further, the time it takes to transmit a ^ g 
packet is calculated as LRH:PktLen*4/Lr where L^ is the port speed as ob- 
tained from the Portlnfo: LinkWidthActive and Portlnfo:LinkSpeedActive 20 
attributes. 

21 

The Inter Packet Delay (IPD) value is an 8-bit integer and is interpreted as 22 
depicted in the table below. Note that all 256 possible values are legal. 23 

24 

Table 63 Inter Packet Delay 25 

IPD Multiplier rate Comment 

0 0 100% Suited for matchedlinks 28 

1 1.00 50% 29 

2 2.00 33% Suited for 30 Gbps to 10 30 

Gbps conversion 31 

3 3.00 25% Suited for 10 Gbps to 2.5 ^2 

Gbps conversion 33 

11 11.00 8% Suited for a 30 Gbps to 2.5 

Gbps conversion 35 

36 

See 17.2.6 Static Rate Control on page 803 for which values of IPD CAs 37 
are required to support. 28 

39 

C9-226: If a CA is requested to use an unsupported value, the CA shall 
pick a supported value, and return that value in the appropriate MADs or 
verbs. 41 

42 
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Each connected QP (EE context for RDs) should have a programmed IPD 1 
value. UDs should include the IPD in the WQE. 2 

3 

The same value of IPD should be programmed for each connected QP ^ 
and WQE using the same port and same DLID should be the same. If dif- 
ferent values of IPD are programmed, the CA may use any of these values 
for any of this traffic. 6 

7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
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28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfiniBand^'^ Trade Association 



Page 368 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 
Volume 1 - General Specifications 



Software Transport Interface 



October 24, 2000 
FINAL 



Chapter 10: Software Transport Interface 



1 

2 
3 



10.1 Overview 



10.1.1 Introduction 



This chapter describes the software transport layer of the IBA. The soft- 
ware transport defines the capabilities and behavior of the Channel Inter- 
face (CI), the presentation of the channel to the Verbs Consumer. This 
interface is implemented as a combination of the Host Channel Adapter 
(HCA), its associated firmware, and host software. Specification of the 
API used by the Verbs Consumer to access the capabilities of the CI is 
outside of the scope of this architecture. 

A concept frequently encountered in this specification is that of Verbs 
Consumer. The precise meaning of the phrase varies, as a function of 
context, but it always means, as defined in the Glossary, the executing en- 
tity employing the capabilities of the CI to accomplish some objective. In 
some instances the Verb Consumer may be a OS kernel thread, in others 
a user-level application, and in still others, some special, privileged pro- 
cess. Where the difference is important to the correct behavior of an im- 
plementation, it is defined explicitly, as in 11.1 Verbs Introduction and 
Overview on page 446 : elsewhere, it is left unspecified. 

While the Partitioning section is not strictly part of the software transport 
layer, it describes ideas that connect intimately with the semantics of the 
Queue Pair (QP), and are therefore reasonably elaborated in this chapter. 
In addition, giving the descriptions of the necessary entities here ensures 
their inclusion in the architecture specification. 



The CI is the locus of interaction between the consumer of IBA services 
and the instantiation of an IBA fabric. Access to the HCA is via Verbs, 
which enable creation and management of QPs, management of the 
HCA, and coping with error indications from the CI that may be surfaced 
via the Verbs. All these activities must be carried out so as to enable Verbs 
consumers to enjoy the same level of protection and security as are guar- 
anteed other entities supported by the host operating system. 

Fundamental to CI interaction is management of HCAs, which includes ar- 
ranging access to them, accessing and modifying selected of their at- 
tributes, and shutting them down. These activities are described below, 
and details of the corresponding Verbs layer semantics are given in the 
next chapter. 

Entities with central importance to CI operation are QPs. They must be 
created, associated with protection domains, modified as required, and 
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10.2 Managing HCA Resources 



10.2.1.1 Opening an HCA 



The Verb used to open an HCA returns an opaque object or handle to 
uniquely reference each HCA so that Consumers can distinguish between 
HCAs in the endnode. 



10 
11 
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destroyed to free up resources when no longer needed. In use, they pro- 1 

vide repositories for addressing information needed by Verbs consumers, 2 

as well as protection information to guarantee the operational integrity of 3 

themselves and the host system. QPs provide for various modes of oper- ^ 
ation, depending upon the requirements of the consumer. Details of these 
modes, as well as the means of establishing them for a QP are described 

below, and corresponding Verbs semantics are detailed in the next ^ 

chapter. For a graphical depiction of the QP, see Figure 11 on oaae 59 . 7 

8 

As a central mode of QP operation, direct, protected access to consumer g 
memory is critical to realizing the performance potential of the IBA. This 
chapter describes the semantics of memory access defined for the archi- 
tecture, detailing the ideas of memory regions and windows, and their reg- 
istration, access keys for local and remote access to registered memory, ^ ^ 
and the management of errors that may arise in this context. ^ 3 

14 

A Work Request (WR) is an elementary object in the software transport 1 5 

layer, used by consumers to enqueue Work Queue Elements (WQEs) to ^5 

the Send and Receive queues of a QP. The WQE is what identifies the in- ^ ^ 
dividual events of communication over the IBA fabric. A graphical depic- 
tion of the WQE and QP can be seen in Figure 12 on page 60 . The five 

varieties of WRs, and the dynamics of their creation, use, and disposition ^ ^ 

via entries in Completion Queues (CQEs) are described in the sections to 20 

follow, as are the disposition of errors that may arise as they are used. De- 21 

tails of their contents are given in the next chapter. 22 

23 
24 

10.2.1 HCA 25 

Verbs allow the Consumer to open an HCA, retrieve HCA attributes, 26 
modify HCA attributes that can be changed by the Consumer, and close 27 
the HCA. 28 

29 

Queue Pairs, Completion Queues and other resources associated with a 
specific HCA instance cannot be shared across multiple HCAs, even if 
they are managed by the same device driver software. 31 

32 

The intent of the architecture is to allow an implementation to pass Work 33 
Requests and Completion Status to and from a user space Consumer pro- 34 
cess to the HCA without kernel involvement. 25 

36 
37 
38 
39 
40 

Opening an HCA prepares the HCA for use by the Consumer. Once 41 
opened an HCA cannot be opened again until after it has been closed. 42 
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10.2.1.2 HCA Attributes 



HCA Attributes are device characteristics. These attributes must be able 
to be retrieved by the Consumer. The full list of HCA Attributes are defined 
in 11 .2.1 .2 Query HCA on page 449 . 



10.2.1.3 Modifying HCA Attributes 



10.2.1.4 Closing an HCA 



10.2.2 Addressing 
10.2.2.1 Source Addressing 



Modification of a restricted set of HCA attributes is permitted. This is pri- 
marily restricted to performance and error counter management informa- 
tion. Most HCA Attributes are either fixed or manipulated through the 
Fabric Management Interface or General Services Interface. 



Close restores the HCA to its initialized condition, and deallocates any re- 
sources allocated during the HCA open. 

It is not the responsibility of the Channel Interface to track any resources 
which were not allocated by the HCA open. 



For global addressing, each HCA Source Port has a GID Table containing 
the valid GIDs for the Source Port. The GID Table is obtained via the 
Query HCA Verb. 

01 0-1 : For each HCA Source Port, the CI shall maintain a GID Table con- 
taining the valid GIDs for the Source Port. 

Each Address Vector contains a Source GID Index. The Source GID 
Index specifies an index into the Source GID Table. The entry referenced 
by the Source GID Index defines the Source GID associated with the Ad- 
dress Vector. 

CI 0-2: For each GID Table, the first entry in the table shall contain the 
read-only invariant value of the Base GUID. 

For local addressing, IBA also provides for the assignment of multiple 
LIDs to an HCA port through the LMC. The LMC specifies the number of 
least significant bits of the LID that a HCA port masks (ignores) when val- 
idating that a packet DLID matches its assigned LID. 

CI 0-3: Each Address Vector shall contain specific Source Port LID Path 
Bits. 

The most significant bits of the LID combined with the Path Bits define the 
Source Port LID associated with the Address Vector. 
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10.2.2.2 Destination Addressing 1 

Addressing of destination endpoints is determined based on the Service 2 
Type of the QP: 3 

4 

CI 0-4: For Connection-oriented QPs, the destination address shall be 5 
stored in the QP Context, and shall be manipulated exclusively through g 
the Modify Queue Pair Attributes Verbs. ^ 

01 0-1 : If the CI supports the RD Service, then for Reliable Datagram QPs, ^ 

the destination address shall be stored in the EE Context, and shall be 9 

manipulated exclusively through the Modify EE Context Attributes Verb, 1 0 

and an EE Context shall be referenced by each individual Work Request 11 

(see 10.2.6 End-to-End Contexts ). ^ 2 

13 
14 
15 

01 0-5: For Unreliable Datagram QPs, the destination address of the node 1 6 
shall be contained in an Address Handle, and an Address Handle shall 17 
be referenced by each individual Work Request. 13 



o10-2: If the CI supports Raw Datagram Service, then for Raw QPs, the 
destination address shall be supplied via each individual Work Request. 



19 
20 
21 



An Address Handle is a consumer-opaque object that refers to a local or 
global destination. Verbs are used to create, modify and destroy Address 
Handles. Address handles are associated with protection domains. Pro- 
tection domains are described in 9.11.2.3 Protection Domains. 22 

23 

01 0-6: The CI shall support sending messages from a QP or EE ad- 24 
dressed to the same or a different QP/EE on the same port in the sending 25 
HCA. Such messages shall not be transmitted through the fabric, but 
shall remain contained within that HCA. 

27 

No special addressing mechanisms are necessary to accomplish this; 28 
instead, the destination information in the source QP, EE, Address Vector, 29 
or Work Request is the same as that which any other node on the fabric 30 
would use to address the destination QP/EE. 31 

32 

10.2.2.3 Protection Domains 33 

A Protection Domain (PD) is used to associate Queue Pairs with Memory 34 
Regions and Memory Windows, as a means for enabling and controlling 35 
HCA access to Host System memory. PDs are also used to associate Un- 
reliable datagram queue pairs with Address Handles, as a means of con- 
trolling access to UD destinations. Queue Pairs are described in 10.2.3 
Queue Pairs . Memory Regions and Memory Windows are described in 
detail in Section 10.6 Memorv Management . PDs are specific to each 39 
HCA. 40 

41 
42 



InfiniBand^'^ Trade Association 



Page 372 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Software Transport Interface October 24, 2000 

Volume 1 - General Specifications FINAL 

CI 0-7: Each Queue Pair in an HCA shall be associated with a single PD. 1 
Multiple Queue Pairs shall be able to be associated with the same PD. 2 

3 

C1 0-8: Each Memory Region, Memory Window, or Address Handle shall ^ 
be associated with a single PD. Multiple Memory Regions, Memory Win- 
dows, or Address Handles shall be able to be associated with the same 
PD. 6 

7 

C10-9: Operations on a Queue Pair that access a Memory Region or a 8 
Memory Window shall be allowed only if the Queue Pair's PD matches 9 
the PD of the Memory Region or Memory Window. ^ q 

11 

C10-10: Operations on unreliable datagram queue pairs that access an 
Address Handle shall be allowed only if the Queue Pair's PD matches the ^ ^ 
PD of the Address Handle. If there is a mismatch, the Channel Interface ^ 3 
shall return either Invalid Address Handle as an immediate error or Local 14 
Operation Error as a completion error. 1 5 

16 
17 

Protection Domains are allocated through the Verbs. 

19 
20 
21 

A PD has no IB architected attributes. Operating Systems are commonly 22 
expected to enforce the policy that when a Verbs consumer creates a 23 
Queue Pair, registers a Memory Region, allocates a Memory Window, or 24 
allocates an Address Handle, it must specify a PD for association with the 25 
IB resources owned by it (that is, that were allocated by it). 



10.2.2.4 Allocating a Protection Domain 



A PD is required when creating a Queue Pair, registering a Memory Re- 
gion, allocating a Memory Window, or creating an Address Handle. 



26 

10.2.2.5 Deallocating a Protection Domain 27 

28 

Protection Domains are deallocated through the Verbs. 

29 

CI 0-11: A PD shall not be deallocated if it is still associated with any 30 
Queue Pair, Memory Region, Memory Window, or Address Handle. If this 31 
is attempted, the Verbs shall return an immediate error. 32 

33 
34 

The Verb consumer uses a Verb to submit a Work Request (WR) to a 35 
Send queue or a Receive queue. Associated Send and Receive queues 
are collectively called a Queue Pair (QP); these QPs drive the channel in- 
terface. A QP, which is a component of the channel interface, is not di- 
rectly accessible by the Verbs consumer and can only be manipulated 
through the use of Verbs. See 10.8 Work Request Processing Model for 39 
a description of the WR submission process. 40 

41 
42 



10.2.3 Queue Pairs 



InfiniBand^'^ Trade Association 



Page 373 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Software Transport Interface October 24, 2000 

Volume 1 - General Specifications FINAL 

10.2.3.1 Creating a Queue Pair 1 

Queue Pairs are created through the Verbs. 2 

3 

When a QP is created, a complete set of initial attributes must be specified 4 
by the Consumer. The attributes that need to be defined when the QP is 5 
created are denoted in 11.2.3.1 Create Queue Pair on page 459 . g 

The maximum number of Work Queue Entries (WQEs) the Consumer ex- ^ 

pects to be outstanding on each work queue of the Queue Pair must be ^ 

specified when the QP is created. The actual number of entries is returned 9 

through the Channel Interface for each work queue. 10 

11 

When setting the maximum number of outstanding work requests on a ^2 
work queue, the consumer must take into account that this number must 
be large enough to encompass the number of work requests on that 
queue that have not completed plus the number of completed work re- 
quests for that queue that have not been freed through the associated CQ 1 5 
(see 10.8.5.1 Freed Resource Count on page 425 ). Note for unsignaled 1 6 
completions, the consumer cannot consider the work request completed 1 7 
until the work request has been confirmed completed as per 10.8.6 Unsig- ^ g 
naled Completions on page 426 . 

20 
21 
22 
23 
24 

10.2.3.3 Modifying Queue Pair Attributes 25 

Certain QP Attributes may be modified after the QP has been created. 26 
The subset of QP Attributes which can be modified are defined in 11.2.3.2 27 
Modify Queue Pair on page 460 . 28 



10.2.3.2 Queue Pair Attributes 



Queue Pairs have attributes that can be retrieved through the Query 
Queue Pair Verb. The complete list of QP Attributes is described in 
11 .2.3.3 Querv Queue Pair on page 467 . 



29 
30 



It is possible to modify the QP Attributes with Work Requests outstanding 
on the QP. Any Work Requests outstanding on the specified QP may not 
execute properly when the attributes are changed. 

32 

When setting the maximum number of outstanding work requests on a 33 
work queue, the consumer must take into account that this number must 34 
be large enough to encompass the number of work requests on that 35 
queue that have not completed plus the number of completed work re- ^5 
quests for that queue that have not been freed through the associated CQ 
(see 10.8.5.1 Freed Resource Count ). Note for unsignaled completions, 
the consumer cannot consider the work request completed until the work 38 
request has been confirmed completed as per 10.8.6 Unsignaled Comple- 39 
tions . 40 

41 
42 
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A CI may support the ability to modify the maximum number of out- 1 

standing Work Requests on a QR If it does so, it must be able to support 2 

it while Work Requests are outstanding. In addition, it must support re- 3 

sizing both work queues on every QP. If immediate errors are returned, ^ 
the work queue(s) must be in the same state as it was prior to the attempt 
to resize the work queue(s). It is understood that this may adversely affect 

performance, but it must not be the cause of immediate, completion or 6 

asynchronous errors, with the exception of immediate errors returned by T 

the Resize Queue Pair Verb. Note that a resize operation may adversely 8 

affect other QPs attempting to communicate with the QP during the resize 9 

operation in the form of timeouts and retries. It may also result in the loss ^ q 
of data in the form of dropped packets for unreliable service type QPs, 

1 0.2.3.4 Destroying a Queue Pair ^ 2 

13 

Queue Pairs are destroyed through the Channel Interface. 

14 

When a QP is destroyed, any outstanding Work Requests are no longer 15 
considered to be in the scope of the Channel Interface. It is the responsi- 16 
bility of the Consumer to be able to clean up any associated resources. 17 

18 

Destruction of a QP releases any resources allocated below the Channel ^ g 
Interface on behalf of the QP. Outstanding Work Requests will not com- 
plete after this Verb returns. 

21 

10.2.3.5 Special QPs 22 

QPs designated as special are the SMI QP (QPO), GSI QP (QP1 ) and the 23 

Raw IPv6 and Raw Ethertype QPs. These QP types are special because 24 

they have different and more restrictive properties than QPs designed for 25 

more general use. 26 

27 

Incoming messages to the SMI and GSI QPs may be consumed below the 
Verbs by a subnet management agent (SMA). If a MAD is consumed 
below the Verbs, the semantics must be consistent from the Verbs Con- ^9 
sumer's point of view. 30 

31 

010-12: Any message processing performed below the Verbs, e.g., by a 32 
SMA, must not disturb any WQEs and CQEs posted on behalf of the 33 

Verbs Consumer. ^ . 

34 

01 0-1 3: The CI shall only allow direct access to the SMI and GSI QPs by 
privileged mode Consumers. 36 

37 

SMI and GSI QPs can share a completion queue, but neither can share 38 
one with any QP that is not of the SMI or GSI type. 39 

40 
41 
42 
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10.2.4 Q Keys 



10.2.5 Completion Queues 



Multiple Raw IPv6 and Raw Ethertype QPs are allowed on a single HCA. 
However, the demultiplexing algorithm that is applied to receiving mes- 
sages between QPs is outside the scope of this specification. 



A Queue Key (Q_Key) is a construct used in Datagram Service QPs to 
validate a remote sender's right to access a local Receive Queue. They 
are set through the Modify Queue Pair Verb, as well as within Work Re- 
quests. The values used for Q_Keys are not managed below the Verbs. 
Q_Keys are contained in the headers for IB Datagram packets. 

Q_Keys have the following properties: 

• A Q_Key must be established in the QP Context before Receive De- 
scriptors can be posted to a QP. 

• The Q_Key is a modifier in the Post Send Request Verb. 

C10-14: For UD Service type QPs, the Q_Key in the QP Context shall be 
used to validate incoming packets. If the Q_Key does not match, the 
packet shall be silently dropped. 

o10-3: If the CI supports the RD Service, then for RD QPs, the Q_Key in 
the QP Context shall be used to validate incoming packets. If the Q_Key 
does not match, the packet shall be NAK'd. This NAK shall result in the 
Send Queue at the remote node being placed into the appropriate error 
state as per the state diagram. 

01 0-1 5: A Q_Key shall be a modifier in the Post Send Request Verb for 
Datagram Service Type queue pairs. The Channel Interface shall ex- 
amine the Q_Key in the Work Request. If the high order bit of the Q_Key 
is set, the outgoing packet shall contain the Q_Key from the QP Context. 
If the high order bit of the Q_Key is not set, the outgoing packet shall con- 
tain the Q_Key from the Work Request. 

For the RD service class, Q_Key violation results in the source Send 
Queue transitioning to the error state. The destination has the option to 
support a Q_Key violation counter and trap. This counter may optionally 
be queried and set through the Queue Pair Verbs. 

Q_Keys are not enforced on Raw QPs. 



A CQ can be used to multiplex work completions from multiple work 
queues across queue pairs on the same HCA. 

CI 0-1 6: The CI shall support Completion Queues (CQ) as the notification 
mechanism for Work Request completions. A CQ shall have zero or 
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22 
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more work queue associations. Any CQ shall be able to service send 1 
queues, receive queues, or both. Work queues from multiple QPs shall 2 
be able to be associated with a single CQ. 3 

4 
5 

Completion Queues are created through the Channel Interface. g 

The maximum number of Completion Queue Entries (CQEs) the Con- ^ 

sumer expects to be outstanding must be specified when the CQ is ere- ^ 

ated. The actual number of entries is returned through the Channel 9 

Interface. It is the responsibility of the Consumer to ensure that the max- 10 

imum number chosen is sufficient for the Consumer's operations; it must, 1 1 

in any case, arrange to handle an error resulting from CQ overflow. -| 2 

13 

C10-17: Overflow of the CQ shall be detected and reported by the CI be- 
fore the next WC is retrieved from that CQ. This error must be reported 
as an affiliated asynchronous error ~ see 11.6.3.2 Affiliated Asvnchronous 1 5 
Errors on page 514 . 16 

17 

1 0.2.5.2 Completion Queue Attributes ^ g 

The only Completion Queue attribute is the maximum number of entries 1 9 
in the CQ. This attribute can be retrieved through the Query Completion 20 
Queue Verb. 2i 

Note that no Verb is provided which retrieves a CQ's set of associated 
Work Queues; the consumer is responsible for keeping track of this infer- 23 
mation, if needed. 24 

25 

10.2.5.3 Modifying Completion Queue Attributes 26 

The CQ must be able to be resized through the Channel Interface while 27 
Work Requests are outstanding. It is understood that this may adversely 28 
affect performance, but it must not be the cause of immediate or comple- 29 
tion errors, with the exception of immediate errors returned by the Resize 
Completion Queue Verb. 

31 

10.2.5.4 Destroying a Completion Queue 32 

Completion Queues are destroyed through the Channel Interface. 

34 

01 0-1 8: If the Verb to destroy a CQ is invoked while Work Queues are still 35 
associated with the CQ, the CI shall return an error and the CQ shall not 35 
be destroyed. 3^ 

38 



Destruction of a CQ releases any resources allocated below the Channel 
Interface on behalf of the CQ. ^® 



40 
41 
42 
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10.2.6 End-to-End Contexts i 

An End-to-End Context (EE Context) is a local HCA resource, used by the 2 

local HCA to track messages transferred between itself and another HCA, 3 

to support Reliable Datagram Service QPs. EE Contexts are established 4 

in an HCA by the Consumer through the Verbs. 5 

6 

o10-4: If the CI supports RD Service, the CI shall support an EE Context ^ 
for use by the Consumer to provide the connection between two HCA 
ports. Each local EE Context shall be connected to exactly one destina- ^ 
tion EE Context. 9 

10 

The Consumer must establish a communication channel between a pair n 
of EE Contexts, one on each HCA, before RD messaging can begin be- -|2 
tween the two HCAs. This communication channel must be established 
using the normal connection style semantics described in Chapter 12, 
Communication Services. 

15 

o10-5: If the CI supports RD Service, multiple connected EE Contexts 16 
(RD channels) shall be allowed between HCA port pairs. These EE Con- 17 
texts are allowed to have either the same or different sets of attributes. 

19 
20 
21 

The Consumer submits RD Work Requests to an RD type QP's Send 22 
Queue. The Work Request specifies the EE Context to use to perform the 23 
actual message transfer. Work Requests may be submitted to a single RD 24 
QP that specify different EE Contexts as long as the EE Context specified 25 
is in the same RDD as the RD QP. 26 

It is the responsibility of the Consumer to create, modify and destroy the 

EE Context, to use the Communication Services to gather the information 28 

necessary to transition the EE Context through the states as well as to fill 29 

out the necessary attributes for use. 30 

31 

It is important to note that Verbs manipulating EE Contexts, such as ^2 
Create EE Context and Modify EE Context, use an EE Context handle, 
but Connection Management and submission of Work Requests to the 
Send Queue require the EE Context number. This number can be re- 
trieved through the Query EE Context Verb. 35 

36 

10.2.6.1 Creating an EE Context 37 

EE Contexts are created through the Channel Interface. 38 

39 

When an EE Context is created, a complete set of initial attributes must 4Q 
be specified by the Consumer. The attributes that need to be defined 

42 



Any Work Requests outstanding on the specified EE Context may not ex- 
ecute properly when the attributes are changed. 
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when the EE Context is created are denoted in Section 11.2.6.1 Create 1 
EE Context on page 474 . 2 

3 
4 

EE Contexts have attributes that can be retrieved through the Query EE 5 



10.2.6.2 EE Context Attributes 

E 

Context Verb 



6 

The complete list of EE Context Attributes is described in Section 11.2.6.3 ^ 
Querv EE Context on page 479 . ^ 

9 

1 0.2.6.3 Modifying EE Context Attributes 1 0 

Certain EE Context Attributes may be modified after the EE Context has 11 
been created. The subset of EE Context Attributes which can be modified 1 2 
are defined in Section 11.2.6.2 Modify EE Context Attributes on page 475 . ^ 3 

14 

It is possible to modify the EE Context Attributes when Work Requests re- ^ 
quiring the EE Context are outstanding. Any outstanding WR which re- 
quires the specified EE context may not execute properly when the 
attributes are changed. 1 7 

18 

10.2.6.4 Destroying an EE Context 19 

EE Contexts are destroyed through the Channel Interface. 20 

21 

When an EE Context is destroyed, any outstanding Work Requests which 22 
depend on the EE Context are expected to complete with an appropriate 23 
error. 

24 

Destruction of an EE Context releases any resources allocated below the 25 
Channel Interface on behalf of the EE Context. 26 

27 

10.2.7 Reliable Datagram Domains 28 

A Reliable Datagram Domain (RDD) is a means to associate Queue Pairs 29 
with EE contexts. 30 

31 

32 



0IO-6: If the CI supports RD Service, each RD QP shall be associated 
with only one RDD. Multiple RD QPs shall be able to be associated with 
the same RDD. 

34 

o10-7: If the CI supports RD Service, each EE context shall be associ- 35 
ated with only one RDD. Multiple EE contexts shall be able to be associ- 36 
ated with the same RDD. 37 



38 
39 



0IO-8: If the CI supports RD Service, WRs which specify an EE Context 
on an RD Queue Pair shall be allowed only if the RDD in the EE Context 
matches the RDD in the QR If the RDDs do not match, the initiator's work ^0 
request will complete with a local operation error, with no effect on the 41 

42 
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destination's receive queue (see 11.6.2 Completion Return Status on 
page 511 for the correct error code). 

o10-9: If the CI supports RD Service, the CI shall support at least two 
RDDs. 

The purpose of defining the RDD construct is to ensure that it is possible 
reliably to separate user and kernel I/O RD traffic through an HCA. Note 
also that realizing that separation requires two EE contexts, as well. 



10.2,7.1 Allocating A Reliable Datagram Domain 



Reliable datagram domains are allocated through the Channel Interface, 
using a privileged operation. 



10.2.7.2 Deallocating A Reliable Datagram Domain 



Reliable datagram domains are deallocated through the Channel Inter- 
face. 

olO-IO: If the CI supports RD Service, an RDD shall not be deallocated 
if it is still associated with any Queue Pair or EE Context. If this is at- 
tempted, the CI shall return an immediate error. 



10.2.8 InfiniBand Header Data and Sources 



The following table lists each of the data items present in an IBA protocol 
header, as well as some internal state needed to send packets. For each 
Transport Service Type, it lists the source of that data or state. The 
sources for each of these are accessed through the Verbs. This table is 
provided to establish the correlation between the packet fields and the 
constructs established through the Verbs. Note that the legend for Table 
38 is Table 39, below. 



Table 64 Packet Fields, Queue Parameters, and their Sources 



Header 


Field 


RC UC 


RD 


UD 


Raw IP 


Raw ET 


lrh 


Virtual lane 


Computed from SL and CAP SL->VL table 


lrh 


LRH version 


Fixed 


LRH 


Service level 


QP 


EE 


AV 


WR 


LRH 


LRH next header - IBA trans- 
port bit 


Fixed=1 


Fixed=0 


LRH 


LRH next header - GRH bit 


QP 


EE 


AV 


Fixed=1 


Fixed=0 


LRH 


Destination local identifier 


QP 


EE 


AV 


WR 



1 

2 
3 
4 
5 
6 
7 
8 
9 
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21 
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29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfiniBand^'^ Trade Association 



Page 380 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Software Transport Interface 



October 24, 2000 
FINAL 



Table 64 Packet Fields, Queue Parameters, and their Sources (Continued) 



Header 


Field 


RC UC 


RD 


UD 


Raw IP 


Raw ET 


LRH 


Packet length 


Computed from data/header length 


LRH 


Source local identifier (part not 
covered by LMC) 


CAP 


LRH 


Source local identifier (part 
covered by LMC) 


QP 


EE 


AV 


WR 


LRH 


Reserved 


Fixed=0 


GRH 


IP version 


Fixed=6 


DS 


N/A 


GRH 


Traffic class 


QP 


EE 


AV 


DS 


N/A 


GRH 


Flow label 


QP 


EE 


AV 


DS 


N/A 


GRH 


Payload length 


Computed from data/header length 


DS 


N/A 


GRH 


Next header 


Fixed 


DS (hw 
disal- 
lows IB 
trans- 
port 
value 
here) 


N/A 


GRH 


Hop limit 


QP 


EE 


AV 


DS 


N/A 


GRH 


Source GID 


Computed from 
CAP table and 
index in QP 


Com- 
puted 
from 
CAP 
table 
and 
index in 
EE 


Com- 
puted 
from 
CAP 
table 
and 
index in 
AV 


DS 


N/A 


GRH 


Destination GID 


QP 


EE 


AV 


DS 


N/A 


BTH 


Opcode 


WR 


N/A 


BTH 


BTH version 


Fixed=0 


N/A 


BTH 


Partition key 


QP 


EE 


QP 


N/A 


D 1 n 


uesiinaiion queue pair 


QP 


WR 


N/A 


BTH 


Pad count 


Computed from data & header length 


N/A 


BTH 


Solicited event 


WR 


N/A 


BTH 


Packet sequence number 


Computed from QP 
state 


Com- 
puted 
from EE 
state 


Com- 
puted 
fromQP 
state 


N/A 


BTH 


Reserved 


Fixed=0 


N/A 
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Table 64 Packet Fields, Queue Parameters, and their Sources (Continued) 



neaoer 


Piolrl 


RC 


UC 


RD 


UD 


Raw IP 


Raw ET 


RDETH 


Remote EE context 


N/A 


EE 


N/A 


RDETH 


Reserved 


N/A 


Fixed=0 


N/A 


DETH 


Queue key 


N/A 


WR or QP depend- 
ing on WR 


N/A 


DETH 


Source queue pair 


N/A 


QP 


N/A 


DETH 


Reserved 


Flxed=0 


N/A 


RETH 


Virtual address 


WR 


N/A 


RETH 


R_Key 


WR 


N/A 


RETH 


DMA length 


WR 


N/A 


AtomicETH 


Virtual address 


WR 


N/A 


WR 


N/A 


AtomicETH 


R_Key 


WR 


N/A 


WR 


N/A 


AtomicETH 


Swap data 


WR 


N/A 


WR 


N/A 


AtomicETH 


Compare data 


WR 


N/A 


WR 


N/A 


AETH 


Message sequence number 


Com- 
puted 


N/A 


Com- 
puted 


N/A 


AETH 


Syndrome 


Com- 
puted 


N/A 


Com- 
puted 


N/A 


RWH 


Reserved 


N/A 


Fixed 


RWH 


EtherType 


N/A 


WR 


AtomicAck- 
ETH 


Original remote data 


Memory 


N/A 


Memory 


N/A 




Local EE context 


N/A 


WR 


N/A 




Port number 


QP 


EE 


QP 




Transport Timeout 


OP 


N/A 


EE 


N/A 




Retry count 


QP 


N/A 


EE 


N/A 




RNR retry count 


QP 


N/A 


EE 


N/A 




MTU 


QP 


EE 


N/A 




Maximum static rate 


QP 


EE 


AV 


WR 




Protection domain 


QP 




Reliable datagram domain 


N/A 


QP/EE 


N/A 




Send PSN 


QP 


EE 


N/A 




Receive PSN 


QP 


EE 


N/A 
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Table 64 Packet Fields, Queue Parameters, and their Sources (Continued) 



Header 


Field 


RC 


uc 


RD 


UD Raw IP Raw ET 




Outstanding atomics/RDMA 
reads supported at destination 


QP 


N/A 


EE 


N/A 




Send CO 


OP 




Receive CO 


QP 



1 

2 
3 
4 
5 
6 
7 
8 



The following table is the legend for the previous one. 

Table 65 Legend for Table 64 



Abbreviation 


Meaning 


AETH 


Acknowledgement extended transport header 


AtomicAck- 
ETH 


Atomic acknowledgement extended transport header 


AtomicETH 


Atomic extended transport header 


AV 


Part of the address vector object defined by the Verbs 


BTH 


Base transport header 


OA 


Property of the channel adapter 


CAP 


Property of the channel adapter port 


Computed.., 


Calculated from other values as specified 


DETH 


Datagram extended transport header 


DS 


Field taken from data segment pointed to by work request 


EE 


Taken from the EE context (RD service only) 


Fixed 


Value is determined by the specification and is the same in all pack- 
ets. 


GRH 


Global routing header 


LRH 


Local routing header 


Memory 


Retrieved from host memory by CA 


MTU 


Maximum transfer unit 


N/A 


Not applicable to this Service Type 


QP 


Taken from Queue Pair state (the real QP in the case of RD service) 


Raw 


Raw Packet service 


RC 


Reliable Connected service 


RD 


Reliable Datagram service 
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Table 65 Legend for Table 64 



Abbreviation 


Meaning 


RDETH 


Reliable datagram extended transport header 


RNR 


Receiver not ready 


RWH 


Raw ethertype header 


UC 


Unreliable Connected service 


UD 


Unreliable Datagram service 


WR 


Taken from a Work Request 



10.3 Resource States 

10.3.1 Queue Pair and EE Context States 

This section contains the QP and EE Context state diagram. The same 
state diagram is used for both QPs and EE Contexts. This section will use 
the term QP/EE for this and, where differences are Important, will note 
them. This section will contain a definition of the QP/EE states. The 
QP/EE states defined here and the transition order between the states are 
shown in Figure 115. EE Contexts are created only for the Reliable Data- 
gram Service Type, whereas QPs are used for all Transport Service 
Types. 

Note that while QPs and EE Contexts share the same state diagram, the 
EE Context state has no relationship to the states of the sending and re- 
ceiving RD QPs using that EE Context. The reader should not assume 
that because a QP made a state transition that a EE Context associated 
with that RD QP will also transition, and vice versa. 

Even though a subset of the states could occur in any order for some of 
the Transport Service Types, the states must transition in the order spec- 
ified. This is to keep the state definitions consistent and error semantics 
simplified. The order chosen is that required to support the Reliable Con- 
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nection Service Type, and to provide for completeness of the information 
needed to transfer data using an EE Context. 



Create QP/EE 




Modify QP/EE 



Modify 
QP/EE 




F^poceSsing 
/ .Error 
/(DepeRderit' 
; ,'oqOP type) 



Modify 
QP/EE 



Receive Wrt 
Completion Errpr or 
Async Error. 



Notes: 

Destroy QP/EE Context can be 
called from any state and exits 
the state diagram 

It is possible to transition from any 
state to the Reset state with the 
Modify QP/EE Verb. 

An error can be forced from any 
state with the Modify QP/EE Verb. 





QP: Can Post Recv 




WRs 




EE: Can be connected 




with another EE 




Context 




QP: Can post & process 
Recv WRs & 
Send ACKs. 
EE: EE can be used to 
process incoming 
messages on RDs 



QP: Can Post & process 
Recv & Send WRs 
EE: EE can be used for 
outgoing WRs on RDs 
& process incoming 
messages on RDs 



QP: Can post & process 
Recv WRs & 
Send ACKs. 
Can post Send WRs. 
EE: EE can be used to 
process incoming 
messages on RDs 



Figure 115 QP/EE Context State Diagram 



It is important to understand that tfie QP/EE states are Intended to be 
used as an Integral part of the connection process. A thorough under- 
standing of the connection process and the connection state diagram Is 
assumed. This is discussed in detail in 12.9.7.1 Active States on oaae 547 
and 12.9.7.2 Passive States on pace 549 . 
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C10-19: The CI shall implement the Reset, Init, RTR, RTS, SQD. SQEr, 1 
and Error states for each QP. Transitions between those states must be 2 
restricted to those shown in Figure 115. 3 



Any resources required to implement the QP/EE have been allo- 
cated. For example, some implementations require WQEs and/or 
control structures to be allocated. 



It is an error for a Work Request to specify an EE Context in the 
Reset state. 



For QPs: 



4 

5 



0IO-II: If the CI supports RD Service, the CI shall implement the Reset, 
Init, RTR, RTS, SQD, and Error states for each EE Context. Transitions 
between those states must be restricted to those shown in Figure 115. ^ 

7 

10.3.1.1 Reset 8 

The characteristics of the Reset state are: 9 

10 

01 0-20: A newly created QP/EE shall be placed in the Reset state. n 



12 
13 



It is possible to transition to the Reset state from any other state 
by specifying the Reset state when modifying the QP/EE at- 
tributes. 



15 
16 
17 
18 



• The Modify Queue Pair and Modify EE Context Attributes Verbs 
are the only way for the Verbs Consumer to cause a transition out ^ ^ 
of the Reset State, without destroying the EE/QP. 20 

C10-21 : Upon creation, or transition to the Reset state, all QP/EE at- 

tributes must be set to the initialization defaults, as documented in the 22 

Create Queue Pair and Create EE Context Verbs. 23 

24 

• Transition out of the Reset state can be effected by calling the 25 
Destroy Queue Pair or Destroy EE Context Verbs, thus exiting 26 
the state diagram. 

For EEs: 



27 
28 
29 
30 
31 
32 



O10-12: If the CI supports RD Service, and a Work Request is submitted 
to the Send Queue of an RD QP specifying an EE Context in the Reset 
state, the Work Request shall be completed in error. 

34 

No work requests can be outstanding which use an EE Context in 35 
this state. 36 

o10-13: If the CI supports RD Service, any incoming messages which 37 
target an EE in the Reset state must be silently dropped. 38 



39 
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10.3.1.2 Initialized (Init) 



CI 0-22: If a Work Request is submitted to a Wori< Queue while its corre- 
sponding QP is in the Reset State, an immediate error shall be returned. 

• The Work Queues are empty. No Work Requests are outstanding 
on the work queues. 

• All Work Queue processing is disabled 

CI 0-23: Incoming messages which target a QP In the Reset state must 
be silently dropped. 



The characteristics for the Initialized state are: 

• The basic QP/EE attributes have been configured as defined in 
Modify Queue Pair and Modify EE Context Attributes Verbs. 

• Transition into this state is only possible from the Reset state. 

• The Modify Queue Pair or Modify EE Context Attributes Verbs are 
the only way for the Verbs Consumer to cause a transition out of 
the Init state, without destroying the EE/QP. 

• Transition out of the Init state can be effected by calling the De- 
stroy Queue Pair or Destroy EE Context Verbs, thus exiting the 
state diagram. 

For EEs: 

©10-14: If the CI supports RD Service, any incoming messages which 
target an EE in the Init state must be silently dropped. 

It is an error for a Work Request to specify an EE Context in the 
Init state. 

O10-15: If the CI supports RD Service, and a Work Request is submitted 
by the Consumer to the Send Queue of an RD QP specifying an EE Con- 
text in the Init state, the Work Request shall be completed in error. 

For QPs: 

• Work Requests may be submitted to the Receive Queue but in- 
coming messages are not processed. 

CI 0-24: The CI shall allow Work Requests to be submitted to a Receive 
Queue while its corresponding QP is in the Init State. 

• It is an error to submit Work Requests to the Send Queue. 

CI 0-25: If a Work Request is submitted to a Send Queue while its corre- 
sponding QP is in the Init State, an immediate error shall be returned. 

• Work Queue processing on both queues is disabled. 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
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27 
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C10-26: Incoming messages which target a QP in the Init state must be 1 
silently dropped. 2 

3 

1 0.3.1 .3 Ready to Receive (RTR) ^ 

The characteristics for the Ready to Receive state are: 5 

g 

C10-27: The CI shall support posting Work Requests to Receive Queue 

of a QP in the RTR state. ^ 

8 

C10-28: Incoming messages targeted at a QP in the RTR state shall be 9 
processed normally. 10 

11 

0IO-I6: If the CI supports RD Sen/ice, and an incoming message is ad- ^2 
dressed to an EE Context in the RTR state, the message shall be pro- 
cessed normally. 

Transition into this state is possible only from the Init state, using 1 5 
the Modify Queue Pair or Modify EE Context Attributes Verbs. 16 

• Transition out of the RTR state can be effected by calling the De- 
stroy QP or Destroy EE Context Verbs, thus exiting the sitate dia- 18 
gram. 19 

For EEs; 20 

21 

• It is an error for a Work Request to specify an EE Context in the 22 
RTR state. 23 

O10-17: If the CI supports RD Service, and a Work Request is submitted 24 
by the Consumer to the Send Queue of an RD QP specifying an EE Con- 25 
text in the RTR state, the Work Request shall be completed in error. 25 

27 
28 

• Work Queue processing on the Send Queue is disabled. It is an 29 
error to post Work Requests to the Send Queue. 30 

C10-29: If a Work Request is submitted to a Send Queue while its corre- 31 
spending QP is in the RTR State, an immediate error shall be returned. 32 

33 

1 0.3.1 .4 Ready to Send (RTS) 34 

Before transitioning to this state, the QP/EE communication establish- 35 
ment protocol must be completed. 32 

37 
38 

The channel between the requester's QP/EE and responder's ^9 
QP/EE has been established for connected Service Types and 40 
RD channels. 41 

42 



For QPs: 



The characteristics for the Ready to Send state are: 
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• Transition into this state is possible only from the RTR and SQD 1 
states. 2 

• The Modify Queue Pair or Modify EE Context Attributes Verbs are 3 
the only way for the Verbs Consumer to cause a transition out of 4 
the RTS state, without destroying the EE/QP. 5 

• Transition out of the RTS state can be effected by calling the De- 6 
stroy Queue Pair or Destroy EE Context Verbs, thus exiting the 7 
state diagram. g 

01 0-30: The CI shall support posting Work Requests to a QP in the RTS 9 
state. 10 

11 
12 
13 

01 0-32: Incoming messages targeted at a QP in the RTS state shall be 1^ 
processed normally. 1 5 

16 

0IO-I8: If the CI supports RD Service, and an incoming or outgoing mes- -jy 
sage utilizes an EE Context in the RTS state, the message shall be pro- 
cessed normally. 

1 0.3.1 .5 Send Queue Drain (SQD) 20 

21 
22 

010-33: The CI shall support posting Work Requests to a QP in the SQD 23 
state. 24 

25 

010-34: Incoming messages targeted at a QP in the SQD state shall be 25 
processed normally. 2^ 

o10-19: If the CI supports RD Service, and an incoming message utilizes 28 
an EE Context in the SQD state, the message shall be processed nor- 29 
mally. 30 

31 

• Transition into this state is possible only from the RTS state, us- 32 
ing the Modify Queue Pair or Modify EE Context Attributes Verbs. ^3 

010-35: When transitioning into the SQD state, the QP/EE's send logic 34 

must cease processing any additional messages, It must also complete 35 
any outstanding messages on a message boundary, and process any in- 
coming acknowledgements. The CI must not begin processing additional 
messages which had not begun execution when the state transition oc- 

curred. 38 

39 

010-36: When all expected acknowledgements have been received, and 40 

processing of send queue work requests has ceased, and if event notifi- 4^ 

42 



The characteristics for the Send Queue Drain state are: 
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cation has been requested, an Affiliated Asynchronous Event shall be 1 
generated. 2 

3 

• The consumer can use the asynchronous event to determine ^ 
when a state transition is possible. ^ 

It is possible to enter the RTS state or error states from the SQD g 
state via Modify Queue Pair or Modify EE Context Attributes 
Verbs. 



7 
8 

• Attributes may be modified during the transition from SQD to g 
RTS, but both sides must have received the affiliated asynchro- 
nous event in order to safely change attributes. 

It is also possible to transition out of the SQD state by calling the 
Destroy Queue Pair or Destroy EE Context Verbs, thus exiting 
the state diagram. 

For EEs: 



Work Queue processing on the Send Queue is disabled. 



10 
11 
12 
13 
14 
15 



• Work Queue processing on the Send side of the EE Context is ^ ^ 
disabled. 17 

O10-20: If the CI supports RD Service, Work Requests submitted to the 
Send Queue of an RD QP, which specify an EE Context in the SQD state, 1 9 
must not be processed but shall remain enqueued. 20 

21 

QPs associated with an EE do not transition to the SQD state au- 22 
tomatically, nor is it inherently necessary they do so. 23 

For QPs: 24 

25 
26 

01 0-37: Work Requests submitted to the Send Queue of a QP in the SQD 27 
state must not be processed but shall remain enqueued. 23 

1 0.3.1 .6 Send Queue Error (SQEr) 

30 

The characteristics for the Send Queue Error state are: 

31 

• Transition into this state can only happen as the result of a Com- 32 
pietion Error, which occurred during the processing of a Work Re- 33 
quest on the Send Queue while in the RTS state. 34 

01 0-38: Receive Work Requests which were submitted to a Receive 35 

Queue prior to that queue's transition into the SQEr state shall continue 36 

to be processed normally. New Receives must be able to be posted to 37 

such a Receive Queue. 3g 



39 
40 



010-39: A Work Request which caused the Completion Error leading to 
the transition into the SQEr state must return the correct Completion Error 
Code for the error through the Completion Queue. 

42 
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• This WR may have been partially or fully executed, and thus may 1 

have affected the state of the receiver, as follows: 2 

Send operations may have been partially or fully completed; be- 3 

cause of this, a completion queue entry may or may not have been 4 

generated on the receiver. 5 

RDMA Read operations may have been partially completed; be- 6 

cause of this, the contents of the memory locations pointed to by 7 

the data segments of their Work Requests are indeterminate. 3 

RDMA Write operations may have been partially completed; be- 9 

cause of this, the contents of the memory locations pointed to by 10 
the remote address of their Work Requests are indeterminate. If 
the operation specified Immediate Data, a completion queue entry 
may or may not have been generated on the receiver. 



11 
12 

13 

Atomic operations may, or may not have been attempted; because 



• Transition out of the SQEr state can be effected by calling the De- 
stroy Queue Pair Verb, thus exiting the state diagram. 

10,3.1.7 Error 

The characteristics for the Error state are: 

Normal processing on the QP/EE is stopped. 

CI 0-41 : A Work Request which caused the Completion Error leading to 
the transition into the Error state must return the correct Completion Error 
Code for the error through the Completion Queue. 



15 
16 



of this, the contents of the memory locations pointed to by the re 
mote address of the Work Request may have a value consistent 
with either event. At the local node, the contents of the memory lo- 
cations pointed to by the data segments of their Work Requests 1 7 
are indeterminate. 18 

CI 0-40: Work Requests on the Send Queue, subsequent to that which 19 

caused the Completion Error leading to the transition into the SQEr state, 20 

must return the Flush Error completion status through the Completion 21 

Queue. 22 

23 

• Depending on the Service Type of the QP, some of the subse- 
quent WRs may have been in progress when the error occurred. 
This may have affected state on the remote node. The possible ^5 
effects depend on the WR type as noted above. 26 

• The Modify Queue Pair Verb can be used to transition from the 
SQEr state to the RTS state. 28 

• The Modify Queue Pair Verb can be used to transition from the 
SQEr state to the Reset or the Error state. 30 

31 

• A Receive Queue Error or an Asynchronous Error will result in a 
transition to the Error State. 

33 
34 
35 
36 
37 
38 
39 
40 
41 
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• This WR may have been partially or fully executed, and thus may 1 
have affected the state of the receiver, as follows: 2 

Send operations may have been partially or fully completed; be- 3 

cause of this, a completion queue entry may or may not have been 4 

generated on the receiver. 5 

RDMA Read operations may have been partially completed; be- 6 

cause of this, the contents of the memory locations pointed to by 7 

the data segments of their Work Requests are indeterminate. g 

RDMA Write operations may have been partially completed; be- 9 

cause of this, the contents of the memory locations pointed to by 10 
the remote address of their Work Requests are indeterminate. If 
the operation specified Immediate Data, a completion queue entry 
may or may not have been generated on the receiver. 

Atomic operations may, or may not have been attempted; because 
of this, the contents of the memory locations pointed to by the re- 
mote address of the Work Request may have a value consistent 

with either event. At the local node, the contents of the memory lo- ^ ^ 

cations pointed to by the data segments of their Work Requests 17 

are indeterminate. 18 

C10-42: Work Requests subsequent to that which caused the Completion 19 

Error leading to the transition into the Error state, including those sub- 20 

mitted after the transition, must return the Flush Error completion status 21 

through the Completion Queue. 22 

23 

• Depending on the Service Type of the QP, some of the subse- 
quent WRs may have been in progress when the error occurred. 
This may have affected state on the remote node. The possible ^5 
effects depend on the WR type as noted above. 26 

• The Modify Queue Pair or Modify EE Context Attributes Verbs, ^7 
specifying a transition to the Reset state, are the only means to 28 
effect a transition from the Error state to the Reset state. 29 

• Transition out of the Error state can also be effected by calling the 30 
Destroy Queue Pair or Destroy EE Context Verbs, thus exiting 31 
the state diagram. 32 

For EEs: 33 

34 

• If a Work Request is in process when the error occurred, the 35 
Work Request is completed with a completion error. 35 

01 0-21 : If the CI supports RD Service, and an RD Work Request uses an 37 

EE context which is in the error state, that WR must be completed in error. 33 

This shall place the Sending QP into the SQEr state. 3g 

40 

Errors that occur on an EE may not have a corresponding effect 
on the QP state. 

42 
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For QPs: 1 

2 

• For Affiliated Asynchronous Errors, it may not be possible to con- 3 
tinue to process Work Requests. In this case, outstanding Work ^ 
Requests will not be completed. 

5 

• When handling the error notification, it is the responsibility of the g 
Consumer to ensure that all error processing has completed prior 

to forcing the QP to reset. 

8 

10.4 Automatic Path Migration 9 

10 

Automatic Path Migration is an optional facility that enables connection re- 
covery in the case of failures. Automatic path migration is available for Re- ^ ^ 
liable and Unreliable Connected QP Service Types and Reliable 12 
Datagram EE Contexts. 13 

14 

This section explains Automatic Path Migration from the software trans- ^5 
port perspective. A hardware-centric description is contained in the 
Channel Adapter section, 17.2.8 Automatic Path Migration on page 804 . ^ ^ 

The Modify Queue Pair and Modify EE Context Attributes Verbs provide ''^ 
the basic capability to load an alternate path and to transition the path mi- 19 
gration states defined in 10.4.1 Path Migration State Diagram . 20 

21 

Automatic path migration is enabled or re-enabled by loading an alternate 22 
path on the pair of connected QP or EE Contexts and setting the path mi- 
gration state to Rearm. The Communication Manager defines protocols 
and mechanisms, which may be used to enable or re-enable Automatic 

Path Migration on both the local and the remote, connected QP or EE 25 

Context. The Communication Manager support for Automatic Path Migra- 26 

tion is described in 12.6 Communication Management Messages on page 27 

519 and in 12.8 Alternate Path Management on page 539 . 28 



29 
30 



Once Automatic Path Migration has been enabled on both ends of a con 
nected QP/EE, it is possible for the migration to be initiated by transi- 
tioning the QP/EE path migration state from Armed to Migrated either from 31 
above or below the Verbs interface. The policy used by the Verbs Con- 32 
sumer or the CI to determine when a path migration should be attempted 33 
is outside the scope of the architecture. 34 

35 
36 

O10-22: If Automatic Path Migration is supported, the CI shall implement 37 
the Migrated, Rearm, and Armed path migration states for each Reliable 
Connected and Unreliable Connected queue pair. Transitions between 
those path migration states must be restricted to those shown in Figure ^® 
116. 40 

41 
42 
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Ready for 
migration 

V 



O10-23: If Automatic Path Migration and Reliable Datagram service are 
supported, the CI shall implement the Migrated, Rearm, and Armed path 
migration states for each EE Context. Transitions between those path mi- 
gration states must be restricted to those shown in Figure 116. 

The path migration states apply to a QP or EE Context, but are only tan- 
gentially related to the QP/EE Context states described in 10.3.1 Queue 
Pair and EE Context States . 

o10-24: If Automatic Path Migration is supported, and the Verbs Con- 
sumer attempts to change the path migration state from Migrated to 
Rearm during a transition to a QP/EE state other than RTS, an immediate 
error shall be retumed. 

O10-25: If Automatic Path Migration is supported, and the Verbs Con- 
sumer attempts to change the path migration state from Armed to Mi- 
grated during a transition from a QP/EE state other than RTS or SQD, an 
immediate error shall be returned. 

The relationship of the path migration states to the communication estab- 
lishment process is defined in 12.9.7 State and Transition Definitions on 
oaae 547 . 

The path migration states are shown in Figure 116. 



CI causes 
transition 




Create 
QP/EE 



IVIodify 
QP/EE 



CI causes 
transition on 

local and 
remote nodes 



Figure 116 Path Migration State Diagram 
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1 



10.4.1.1 Migrated 



O10-26: If Automatic Path Migration is supported, the initial path migration 
state for a QP/EE shall be Migrated. 

The Automatic Path Migration capability is suppressed while the state is 
set to Migrated. 

The Verbs Consumer should leave the path migration state for the QP/EE 
to Migrated under the following circumstances: 

• The local CI does not support Automatic Path Migration. If the 
Verbs Consumer attempts to change the path migration state us- 
ing the Modify Queue Pair or Modify EE Context Attributes Verbs, 
an immediate error will be returned. 

The Verbs Consumer does not wish to enable Automatic Path Mi- 
gration on the QP/EE pair. 

The remote CI does not support or desire Automatic Path Migra- 
tion. If the Verbs Consumer changes the path migration state to 
Armed using the Modify Queue Pair or Modify EE Context At- 
tributes Verbs, the path migration state for the QP/EE is changed 
accordingly and no errors are generated. The local CI shall not 
transition the QP/EE from Rearm to Armed. Handling this condi- 
tion is outside of the scope of the architecture. 

The Verbs Consumer or the CI may set the path migration state to Mi- 
grated when the current path migration state is Armed and the QP/EE 
state is RTS. The decision of when to migrate is a matter of policy, which 
is outside the scope of the architecture. 

O10-27: If Automatic Path Migration is supported, a transition from Armed 
to Migrated shall result in a migration to the alternate path on the local 
QP/EE. The CI shall raise the Path Migrated affiliated asynchronous 
event and shall send the next data packet using this QP/EE on the new 
path with a migration request. 

The remote, connected QP/EE validates this request as defined in section 
17.2.8 Automatic Path Migration on page 804 . 

O10-28: If Automatic Path Migration is supported, upon successfully vali- 
dating an incoming packet's migration request, the CI shall set the path 
migration state for that QP/EE to Migrated, shall migrate to the alternate 
path, and shall also raise the Path Migrated affiliated asynchronous event 
for that QP/EE. 
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10.4.1.2 Rearm 



O10-29: If Automatic Path Migration is supported, upon failing to validate 
an incoming packet's migration request, the CI shall not modify the path 
migration state for that QP/EE, shall not migrate to the alternate path, but 
shall raise the Path Migration Request Failed affiliated asynchronous 
error for that QP/EE. 

The Verbs Consumer should only set the path migration state to Migrated 
when the current path migration state is Armed and the QP/EE state is 
RTS. The Modify Queue Pair or Modify EE Context Attributes Verbs shall 
generate an Immediate error when the Verbs Consumer attempts to set 
the path migration state to Migrated under any other condition. 

O10-30: If Automatic Path Migration is supported, the CI shall only 
change the local path migration state to Migrated when the current state 
is Armed and the QP/EE state is RTS. 



Only the Verbs Consumer is allowed to initiate the transition from Migrated 
to Rearm using the Modify Queue Pair or Modify EE Context Attributes 
Verbs. 

O10-31: If Automatic Path Migration is supported, the CI shall not change 
the local path migration state from Migrated to Rearm except at the re- 
quest of the Verbs Consumer. 

The Verbs Consumer should load or reload the alternate path and ensure 
the remote node has accepted the alternate path prior to transitioning the 
state from Migrated to Rearm. A transition to the Rearm state Indicates to 
the CI that the Verbs Consumer believes this QP/EE is ready to be transi- 
tioned to the Armed state. An invalid or stale alternate path will not gen- 
erate any errors when the Verbs Consumer transitions the state to Rearm. 
Handling this condition is outside the scope of the architecture. 

o10-32: If Automatic Path Migration is supported, a transition to the 
Rearm state shall cause the CI to attempt to coordinate with the remote, 
connected QP/EE to move both the local and the remote connected 
QP/EE into the Armed state in a lock-step manner. 

The details regarding how the CIs perform this transition are contained in 
17.2.8 Automatic Path Migration on oaae 804 . 

The QP/EEs at both ends of the connection must be in the Rearm state 
before the CI can transition them to the Armed state. 
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10.4.1.3 Armed 



10.5 Multicast Services 



The Armed state indicates that the CIs associated with the connected 
QP/EEs on both the local and the remote node are ready to perform a path 
migration. 



Multicast is the ability to send a message to a single address and have it 
delivered to multiple queue pairs which may be on multiple endnodes. 
There are two types of multicast specified by IBA: IBA unreliable multi- 
cast, and raw packet multicast. 

IBA Unreliable Multicast is an optional feature for HCAs. An HCA can be 
queried to determine the number of multicast groups supported by that 
HCA. The number of multicast groups is set to zero if the HCA does not 
support IBA unreliable multicast. 

O10-33: If the CI supports IBA Unreliable Multicast, it must support at 
least one multicast group. 

Raw packet multicast is an optional feature for HCAs. An HCA can be 
queried to determine whether it supports raw packet multicast. 



10.5.1 Multicast Groups and Multicast Message Reception 



A multicast group is a collection of endnodes which receive multicast mes- 
sages sent to a single address. Multicast groups are a fabric management 
responsibility and are targeted through the use of an address. 



10.5.1.1 IBA Unreliable Multicast Reception 



O10-34: If the CI supports IBA Unreliable Multicast, a UD QP must be at- 
tached to a multicast group in order to receive IBA Multicast messages. 

A QP is attached to or detached from a multicast group through the Verbs. 
The only function of the Attach QP to Multicast Verb is to assign a receive 
QP to the multicast group. If the HCA does not have the ability to allow the 
QP to attach to the multicast group, it shall return an immediate error indi- 
cating that there are insufficient resources. 

One or more QPs, up to the maximum supported by the HCA, can be at- 
tached to each multicast group. In order to receive packets sent to the 
Multicast group, every QP attached to a particular multicast group should 
be a member of the same partition as the partition of the incoming packet. 

Only Unreliable Datagram QPs can be used for IBA unreliable multicast. 
Therefore, all Unreliable Datagram semantics also apply to IBA unreliable 
multicast. 
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Only raw packet QPs can be used for raw packet multicast. Therefore, all 
raw semantics also apply to raw packet multicast. 



10.5.2 Multicast Work Requests 

10.5.2.1 IBA Unreliable Multicast Work Requests 



10.5.3 Multicast Destination Establishment 



18 
19 
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10.5.1.2 Raw Packet Multicast Reception 1 

Raw packet QPs are not attached to multicast groups in order to receive 2 

raw packet multicast messages. If an HCA supports only one raw IPv6 QP 3 

per port, all raw IPv6 multicast messages received on a port are delivered 4 

to that port's raw IPv6 QP; if multiple raw IPv6 QPs are supported, raw 5 

IPv6 multicast messages are delivered to a subset of those QPs based on g 
an implementation-defined policy which is outside the scope of IBA. Sim- 
ilarly, if an HCA supports only one raw ethertype QP per port, all raw 

ethertype multicast messages received on a port are delivered to that ^ 

port's raw ethertype QP; otherwise, the distribution of those messages is 9 

again implementation-defined. 10 

11 
12 
13 
14 
15 
16 

IBA unreliable multicast Work Requests must be submitted through the 
Post Send Request Verb to a single destination address. This destination 
address is specified with an Address Handle as part of the Work Request. 
Any Unreliable Datagram QP can be used to initiate an IBA unreliable 
multicast Work Request. A QP is not required to be attached to a Multicast 20 
Group in order to initiate an IBA Unreliable Multicast Work Request. 21 

22 

Send is the only operation allowed on an Unreliable Datagram Send Work 23 
Queue. Atomic and RDMA operations are not allowed. Unreliable Data- ^4 
gram messages must be no larger than the path MTU between the re- 
quester and the responder. Therefore, these restrictions apply to IBA 
unreliable multicast. 26 

27 

10.5.2.2 Raw Packet Multicast Work Requests 28 

Raw packet Multicast Work Requests must be submitted through the Post 29 

Send Request Verb to a single destination address. This destination ad- 30 

dress is specified as a modifier to the Post Send Request Verb. Any raw 3^ 

packet QP can be used to initiate a raw packet multicast Work Request. ^2 

Send is the only operation allowed on a raw packet Send Work Queue. 
Atomic and RDMA operations are not allowed. Raw packet messages 34 
must be no larger than the path MTU between the requester and the re- 35 
spender. Therefore, these restrictions apply to raw packet multicast. 36 

37 
38 

A multicast group is defined by a destination address. Multicast destina- 39 
tion addresses have the same set of attributes as a unicast address. 

41 
42 
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10.6 Memory Management 
10.6.1 Overview 



O10-35: If the CI supports IBA Unreliable Multicast, then the CI shall drop 
all IBA Unreliable Multicast packets if the destination QP number is not 
OxFFFFFF. 

The special multicast QP number does not have to be the QP number 
used by the destination to receive a multicast. 

010-43: The method for preparing a multicast group address as a desti- 
nation shall be the same as any other address specified in a Work Re- 
quest on an Unreliable Datagram or Raw Packet Service Type. 

Creating & Destroying multicast groups are fabric management issues. 
Permitting nodes to join and leave a multicast group is a fabric manage- 
ment issue. The MTU of a multicast group is the MTU specified when the 
multicast group is created and is a parameter in the multicast MAD. 

o10-36: If the CI supports IBA Unreliable Multicast, then Multicast loop- 
back, which is sending an IBA unreliable multicast message to a multicast 
group to which QPs within the sending node are attached, must be sup- 
ported by the CI. 

As with multicast reception, loopback for raw packet multicast depends on 
the number of raw packet queue pairs per port which an HCA supports. If 
an HCA supports only one raw queue pair of each type per port, then no 
loopback is performed; multicast messages sent on such a QP are not re- 
ceived on that same QP. If an HCA supports multiple raw QPs of each 
type per port, then multicast messages sent on one may or may not be 
received on another; the details of this are an implementation-defined 
policy which is outside the scope of IBA. 



The InfiniBand™ Architecture provides sophisticated high performance 
operations like remote DMA and user mode 10. To achieve this goal, The 
InfiniBand^"^ Architecture has to specify appropriate memory manage- 
ment mechanisms. The overriding goals are performance, robustness 
and simplicity. 



10.6.2 Memory Registration 



An HCA, like a typical I/O bus host bridge, accesses Host System memory 

using what this specification refers to as physical memory addresses^ . 
Physical address space for Host System memory is typically organized 

1 . On some Host Systems, such "physical addresses" are actually mapped by 
the Host System memory controller to provide features such as memory 
interleaving or memory sparing, but this specification still refers to them as 
physical addresses. 
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10.6.2.1 Memory Regions 



into pages of fixed or varying sizes, and a given logical data buffer that 
spans page boundaries usually has a non-contiguous physical address 
range. 

Memory Registration provides mechanisms that allow Consumers to de- 
scribe a set of virtually contiguous memory locations or a set of physically 
contiguous memory locations to the Channel Interface in order to allow 
the HCA to access them as a virtually contiguous buffer using Virtual Ad- 
dresses. 

All Consumers must explicitly register the memory locations containing 
data buffers before the HCA can access them. 

010-44: If a CI processes a WR or incoming RDMA or Atomic request that 
attempts to access memory locations that have not been registered, the 
CI must not perform the access, and the CI must return an appropriate 
error. 

Registration may fail due to unavailability of the necessary Channel Inter- 
face resources. No memory is registered in this case. 

01 0-45: Registration must either fully succeed or fail in an atomic 
fashion. 



A set of memory locations that have been registered are referred to as a 
Memory Region. 

The products of a memory registration operation are: 
MemoryRegionHandle 

The Memory Registration Verbs produce a MemoryRegionHandle 
that is used to identify a specific Memory Region to the Memory 
Management Verbs. 

• L_Key 

The Memory Registration Verbs produce an L_Key. The L_Key, 
along with a Virtual Address that is within the bounds of the region 
is used in a Work Requests's data segment to identify a memory 
location within a specific Memory Region to the CI. 

• R_Key 

The Memory Registration Verbs produce, when requested, an 
R_Key. The R_Key, along with a Virtual Address that is within the 
bounds of the region is used in RDMA and Atomic operations to 
identify a memory location within a specific Memory Region to the 
CI. 
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• Virtual Address 1 

The Memory Registration Verbs are supplied (or in some cases ^ 

produce) a Virtual Address that corresponds to the first memory lo- 3 

cation in the set of memory locations supplied to the Memory Reg- 4 

istration Verbs. 5 

When registering a Memory Region, the Consumer specifies whether 6 

Memory Windows are enabled to be bound to the Memory Region or not. 7 

Memory Windows are described in 10.6.6.2.2 Remote Access Through g 

Memory Windows . g 



10,6.3 Access TO Registered Memory 



10 
11 

C10-46: The CI shall support the following access rights: Local Read, 
Local Write, Remote Read, and Remote Write. 

O10-37: If the CI supports Atomic operations, the CI shall support the Re- 
mote Atomic access right. 1 5 

16 

10.6.3.1 Local Access to Registered Memory 17 

A Memory Region is always accessible by the local HCA (i.e. a local HCA 1 8 

is an HCA in the same Host system as the Consumer) it was registered ^ g 

with, the type of access allowed depends on the Access Rights assigned 20 
to that Memory Region. 

01 0-47: The CI shall automatically include Local Read in every Memory 22 

Region's Access Rights. 23 

24 

The Consumer may request that Local Write be assigned to a Memory 25 

Region's Access Rights. If desired, policies related to preventing the as- 26 
signment of Local Write to a Memory Region can be implemented by the 
Consumer. 

28 

10.6.3.2 Remote Access to Registered Memory 29 

The Consumer may, in addition to the Local access rights, assign Remote 
access rights to a Memory Region. Remote access rights are Remote 31 
Read, Remote Write and Remote Atomic. Remote access rights are indi- 32 
vidually selectable and when selected, allow one or more specific opera- 33 
tion types to access the Memory Region. The Consumer is not allowed to 34 
assign Remote Write or Remote Atomic to a Memory Region that has not ^5 
been assigned Local Write. 

00 

01 0-48: If a Memory Registration specifies Remote Write or Remote 37 
Atomic without specifying Local Write, the CI must return an immediate 38 
error. 39 

40 
41 
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10.6.3.3 LOCAL ACCESS Keys 



10.6.3.4 Remote Access Keys 



10.6.3.5 Protection Domains 



10.6-3.6 Scope of Access 



When a set of memory locations are registered, an object called an L_Key, 
that is associated with that Memory Region is returned to the Consumer. 
Work Requests may require the Consumer to supply a locally accessible 
data buffer. Locally accessible data buffers are described by a Virtual Ad- 
dress that points to a location within a Memory Region, the L_Key asso- 
ciated with that Memory Region and the quantity of bytes in the buffer that 
may be used by the Work Request. 

Memory Regions are described to the CI for local access by a combina- 
tion of a Virtual Address within that Memory Region and the L^Key that 
was returned to the Consumer when the region was registered. 



When a memory region is registered with Remote Access Rights, an ad- 
ditional object called an R_Key, that is associated with that Memory Re- 
gion is returned to the Consumer, Work Requests that will initiate an 
RDMA operation require the Consumer to supply a remotely accessible 
data buffer. Remotely accessible data buffers are described by a Virtual 
Address and an R_Key that have been supplied by the target endpoint. A 
Memory Region targeted by a remote operation must have appropriate 
Remote Access Rights for the type of operation. 

Memory Regions are described to a CI for remote access by a combina- 
tion of a Virtual Address within that Memory Region and the R_Key that 
was returned to the Consumer when the region was registered. 



A Protection Domain (PD) associates Memory Regions and Queue Pairs. 
Protection Domains are specific to each HCA. Each Memory Region must 
be associated with a single Protection Domain. Multiple Memory Regions 
may be associated with the same PD. Each Queue Pair in an HCA must 
be associated with a single Protection Domain. Multiple Queue Pairs may 
be associated with the same PD. Access to Memory Regions described 
in Work Requests and Remote Operation requests are allowed only when 
the Protection Domain of the Memory Region and of the Queue Pair that 
is processing the request are identical. The setting of protection domains 
is expected to be controlled by a Privileged Consumer. 



Memory is registered for use on a specific HCA. L_Keys and R_Keys are 
specific to an HCA and do not grant access to the Memory Region by 
other local HCAs. The CI is not required to enforce that L_Keys or R_Keys 
associated with one HCA will always result in an error if used with a dif- 
ferent HCA. 
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10.6.3.7 Multiple Registration of Memory regions 1 

The same set of memory locations may be registered multiple times, re- 2 

suiting in multiple MemoryRegionHandles, L_Keys and R_Keys. Each 3 

Registration is considered a separate and distinct Memory Region and 4 

may be independently associated with a Protection Domain. 5 

6 
7 
8 

For cases where it's desired to have multiple registrations of a specific set 9 
of memory locations, provision for optimizing the use of Channel Interface 1 0 
resources is provided. See Section 11.2.7.7 Register Shared MemorvRe- ^ 1 
qion on page 490 . ^2 

13 
14 

Some processor architectures support global virtual address spaces of 80 ^ g 
bits or more. However, the virtual addresses ("pointers") most applications 
can readily manipulate and supply as parameters are typically either 32 
bits or 64 bits, and actually serve as offsets into the handful of processor ^ 
memory "segments" associated with the process. Thus, the virtual ad- 18 
dress parameters passed in by Consumers at the Verbs layer must each 19 
be interpreted in the proper context of their associated process. The 20 
L_Key or R_Key that accompanies each virtual address parameter helps 21 
the CI identify the appropriate context. 22 

The virtual addresses ("pointers") that Consumers manipulate and pass ^3 
as parameters are referred to simply as Virtual Addresses in this specifi- 24 
cation. The size of the Virtual Addresses used to specify a memory region 25 
to be registered and for local memory locations in Work Requests is im- 26 
plementation dependent. The size of Virtual Addresses used to specify re- 27 
mote memory locations in Work Requests is 64 bits. 23 
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10.6.4.1 Virtual to physical translations 

Figure 117 Registered Virtual Buffer to Physical Page Relationship 
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10.6.4.2 Registration of virtually addressed regions 



A virtually contiguous set of memory locations are specified by a Virtual 
Address that points to the first byte of the set and the length of the set in 
bytes. Figure 117 illustrates a virtually contiguous set of memory locations 
backed by three physical pages. The size of the pages that back the re- 
gion depend on the Host System hardware and Host Operating system. 

01 0-50: The CI shall support arbitrary byte alignment for the virtually con- 
tiguous buffer being registered. 

C10-51: The CI shall support arbitrary length for the virtually contiguous 
buffer being registered, up to the limit specified by the HCA attribute. 

The address translation and access rights of the region applies to each 
complete page within that memory region. The CI is not required to en- 
force access rights for local accesses with byte-level granularity. 
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The pages in the illustration are 4096 bytes each. The actual page size 1 

depends on the host hardware and host operating system. 2 

3 

In the example above, access to the memory locations at Virtual Ad- ^ 
dresses 0x141000 through 0x1411FF may be allowed even though they 
precede the first address of the region requested to be registered. 

6 

10.6.4.2.1 Registered memory residency 7 

CI 0-52: Using the Verbs defined in this specification, when a Memory Re- 8 

gion is registered, every page within the region must be pinned down in 9 

physical memory. 10 

11 

This guarantees to the HCA that the Memory Region is physically resident ^ ^ 
(not paged out) and that the virtual to physical translation remains fixed 
while the region is registered. 

14 

The Verbs that register Virtually Addressed Regions are responsible for 1 5 
requesting that the OS pin the associated pages and for requesting from 1 6 
the OS any required per page Virtual to Physical translation information. ^ 7 
The Channel Interface is not required to track pages common to Multiple 
registrations. The Channel Interface must be able to assume that the OS 
service that accepts requests for pinning and unpinning virtually ad- 
dressed pages will maintain the appropriate reference counts on those 20 
pages such that pinned pages are not actually unpinned until the number 21 
of unpin requests equal the number of pin requests for any specific page. 22 

23 

The Verbs that register Virtually Addressed Regions are expected to re- 24 
quest that the OS pin the pages associated with the region every time a ^5 
region is registered regardless of any association with previously regis- 
tered regions. The Channel Interface is not prohibited from implementing ^6 
optimizations that reduce the number of OS service requests it makes for 27 
pinning and unpinning memory. 28 

29 

10,6.4,3 Registration of physically addressed regions 30 

As an alternative to specifying a Region by a contiguous range in the Con- 31 

sumer's virtual address space mapped by the processor. Privileged Con- 32 
sumers can specify a Region by a list of physically addressed buffers, 
which correspond to pages mappable by the HCA. Besides the list of 
physical buffers, the Consumer supplies a requested "I/O Virtual Address" 

to be associated with the first byte of the Region, which is allowed to begin 35 

anywhere within the first physical buffer. The Consumer also supplies a 36 

byte offset that specifies where the Region begins within the first physical 37 

buffer. The Channel Interface returns the I/O Virtual Address that is actu- 33 

ally assigned for the Region. The Channel Interface is not required to as- 3g 
sign the I/O Virtual Address requested by the Consumer, but is 
encouraged to do so wherever possible. 
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The Consumer also supplies the length of the Region in bytes. The last 
byte of the Region, as specified by the Region length, must fall within the 
last physical buffer, but is allowed to fall anywhere within the last physical 
buffer. 

The Virtual Address in this context is called an "I/O Virtual Address" since 
it isn't necessarily mapped in the processor's virtual address space, and 
might be used solely for local or remote accesses performed by the HCA. 

The Maximum size of an I/O Virtual Address is 64 bits. 



10.6.4.3.1 Physical Buffer lists 



Physical buffer lists used for registration consist of one or more physically 
contiguous memory regions that must start and end on an CI supported 
page boundary. 

01 0-53: If the physical buffer list in a physical memory registration con- 
tains an element that does not start and end on a Cl-supported page 
boundary, the CI shall return an error. 

All of the physical buffers in a physical buffer list must remain accessible 
by the CI until after the region has been deregistered. 

For the case where the physical buffers in the physical buffer list are ac- 
tually the pinned pages of a virtually addressed buffer, the Consumer is 
expected to keep those pages pinned while the region is registered. 

It is the responsibility of the Consumer to determine if and when, after 
de registration the pages should be unpinned. It is the responsibility of the 
Consumer to ensure proper operation in cases where the pages in the 
physically addressed region are also in use in a virtually addressed region 
that has been registered. 

10.6.4.4 Memory Region Error Checking 

it is an error for a Consumer to use Virtual Addresses that are outside of 
the registered locations in a Memory Region. 

10.6.4.4.1 Error Checking of Local Accesses to Memory Regions 

01 0-54: The CI is required to ensure that the memory locations being ref- 
erenced using a Virtual Address and L_Key are within a page of a Memory 
Region with the same PD as the QP that is processing the WR. 

The CI is allowed to support finer-level granularity of local access control. 

010-55: The CI is required to ensure that the Local Access Rights of that 
Memory Region allow the type of access requested. 
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It is strongly encouraged that the Channel Interface check and ensure that 1 

the Virtual Address is within the Memory Region to which the L_Key is as- 2 

sociated and report any bounds violation at access time. It is not manda- 3 

tory that the Channel Interface enforce such checking. ^ 

5 
6 
7 



10.6.4.4.2 Error Checking of Remote Accesses to Memory Regions 

010-56: The CI is required to ensure that the memory locations being ref- 
erenced using a Virtual Address and R_Key are within a Memory Region 
with the same PD as the QP that is processing the Remote Operation. The 8 
CI shall enforce this with a granularity not to exceed 4096 bytes. 9 

10 

01 0-57: The CI is required to ensure that the Remote Access Rights of 
that Memory Region allow the type of access requested. ^ 2 

It is strongly encouraged that the Channel Interface check and ensure that ^ ^ 

the Virtual Address is within the Memory Region to which the R_Key is as- 1 4 

sociated and report any bounds violation at access time. It is not manda- 15 

tory that the Channel Interface enforce such checking. 16 

17 

18 

When access to a Memory Region by a CI is no longer required, the Con- 19 
sumer may reverse the registration process for that region. The process 20 
of deregistering a Memory Region will revoke all HCA access rights to that 
Memory Region. 



10.6.5 DEREGISTRATION OF REGIONS 



21 
22 

Memory locations that have been registered multiple times will be repre- 23 
sented by multiple Memory Regions. The deregistration of single Memory 24 
Region prevents HCA access to those memory locations via the L_Key 25 
(and R_Key if any) associated with that Memory Region. Access to the 26 
memory locations via L_Keys and R_Keys associated with other Memory 27 
Regions is not affected. 

28 

010-58: The CI shall support independent deregistration of partially or 29 
completely overlapping Registered Memory Regions. 30 

31 

010-59: Work Requests or Remote Operation requests that are in pro- 32 
cess and actively referencing memory locations in a Memory Region that 33 
is deregistered must fail with a protection violation. ^4 

01 0-60: Work Requests or Remote Operation requests that attempt to ac- 
cess memory locations in a Memory Region that has been deregistered 36 
must fail with a protection violation. 37 

38 

The Verbs that cause a Memory Region to be deregistered are expected 39 
to request that the OS unpin the pages associated with the region if a re- 
quest to pin those pages was performed when the region was registered, 
regardless of any association with previously registered regions. The 

42 
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Channel Interface is not prohibited from implementing optimizations that 1 
reduce the number of OS service requests it makes for pinning and unpin- 2 
ning memory. 3 



10.6,6 Memory Access Control 



4 

5 

The immediate Consumer of every memory registration related Verb is g 
privileged code in the OS. In general, the OS is responsible for deter- 
mining and enforcing access control policy for memory registrations it 
does on behalf of User-level Consumers. For instance, it is anticipated but ^ 
not required that OSs will enforce policies similar to the following: 9 

10 

• A User-level Consumer has control over which of its memory areas 1 1 
can be accessed by HCA data transfer operations. 1 2 

• A User-level Consumer can enable any local memory area it has ac- 1 3 
cess to for access by HCA data transfer operations. 1 4 

• A User-level Consumer cannot enable HCA read access to memory 1 5 
areas that the Consumer itself doesn't have read access to. 1 6 

• A User-level Consumer cannot enable HCA write access to memory 1 7 
areas that the Consumer itself doesn't have write access to. 1 8 

When a Consumer creates QPs or CQs (through the appropriate Verbs), ^9 

the HCA driver automatically allocates and pins any local memory needed 20 

for the associated control structures. Access by the HCA to these control 21 

structures is implicitly enabled. Access by the Consumer to these control 22 

structures is supported only indirectly through Verbs, and any Region 23 
Handles or L_Keys (if they exist) for the control structures are not exposed 
to the Consumer. 

25 

A Consumer controls which QPs can access which Memory Regions and 26 

which Memory Windows through the use of Protection Domains (PDs). 27 

Prior to creating any QPs, registering any Memory Regions, or allocating 28 
any Memory Windows, the Consumer will allocate one or more PDs. 
Then, when creating QPs, registering Memory Regions, or allocating 
Memory Windows, the Consumer specifies which PD each is associated 
with. QPs can only access Memory Regions or Memory Windows that are 

in the same PD. 32 

33 

10.6.6.1 Local Access Control 34 

With Sends and Receives, the Consumer explicitly specifies the buffers 35 
that are accessed through the local Data Segments it passes in the asso- 36 
ciated Work Requests. Each local Data Segment contains an address, its 37 
associated L_Key, and a length parameter. Multiple local Data Segments ^3 
can be supplied for each send or receive where scatter/gather operation 
is desired. 
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Local Data Segments are also used for RDMA Write gather lists, RDMA 1 

Read scatter lists, and AtomicOp return values. Again each local Data 2 

Segment contains an L_Key which governs local access to the corre- 3 

spending local Memory Region. However, the remote Data Segment as- ^ 
sociated with an RDMA Write, RDMA Read, or AtomicOp will contain an 

R_Key instead of an L_Key. This is discussed further below. ^ 

6 

Two types of local access, read and write, are associated with Memory 7 
Regions. Send buffers and RDMA Write gather buffers require local read 8 
access. Receive buffers, RDMA Read scatter buffers, and AtomicOp re- 9 
turn buffers require local write access. >jq 

11 

Though memory registration is required to enforce local access only to 
page-level granularity, the local Data Segments used by Sends, Receives, ^ ^ 
RDMA Writes, RDMA Reads, and AtomicOps specify byte starting ad- 13 
dresses and byte-count lengths. Thus the Consumer still has byte-level 14 
granularity of access control for local buffers accessed by these locally ini- 1 5 
tiated operations. The Consumer can determine the actual range of ac- 
cess control enforced using the Query Memory Region Verb. 

18 
19 
20 



10.6.6.2 Remote Access Control 



When a Consumer wants to allow remote agents to access its local 
memory using RDMA Writes, RDMA Reads, or AtomicOps, the Consumer 
must explicitly enable remote access and pass an appropriate R_Key to 
the remote agent for it to use when initiating these operations that target 22 
the Consumer's (local) memory. 23 



10.6.6.2.1 Remote Access Directly With Memory Regions 



24 
25 
26 



A Consumer can use either of two mechanisms to enable remote access 
to its memory. The first mechanism involves enabling remote access 
when a Memory Region is registered. The second mechanism involves 
first allocating and then binding a Memory Window to an existing Memory 
Region. Either mechanism results in an R_Key with associated remote 28 
access rights for a specified memory area. 29 

30 

Three types of remote access — read, write, and atomic — are supported. 3^ 
RDMA Write requires write access at the remote target, RDMA Read re- ^2 
quires read access at the remote target, and AtomicOps require atomic 
access at the remote target. While perhaps not obvious, it may make 
sense for a Consumer to allow atomic access but not allow write access, 3^ 
since AtomicOps are not required by the architecture to be atomic with re- 35 
spect to RDMA writes. 36 

37 
38 

When registering a Memory Region, a Privileged Consumer can generally 
specify any combination of remote access rights for the Region, including 
all or none. However, if a registration request does not specify local write 

42 
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access to the region, the CI will return an error if remote write or remote 1 
atomic access is specified. 2 

3 

If any remote access rights are specified, the Verb will return an R_Key. ^ 
This R_Key grants the specified remote access rights for the entire 
Memory Region as bounded by the byte starting address and byte length, 

but the granularity of the access control actually enforced by the Channel 6 

Interface is allowed to be up to 4096 bytes. The Consumer can determine 7 

the actual range of access control enforced using the Query Memory Re- 8 

gion Verb. It is strongly encouraged that the Channel Interface enforce ac- g 
cess control with byte-level granularity. 



11 
12 
13 



10.6.6.2.2 Remote Access Through Memory Windows 

When a Consumer needs more flexible control over remote access to its 
memory, the Consumer can use Memory Windows. Memory Windows are 
intended for situations where the Consumer: 14 

15 

• wants to grant and revoke remote access rights to a registered Re- 1 6 
gion in a dynamic fashion with less of a performance penalty than us- 
ing deregistration/registration or reregistration. 

• wants to grant different remote access rights to different remote ^ g 
agents and/or grant those rights over different ranges within a regis- 20 
tered Region. 

To use a Memory Window, the Consumer allocates one and then binds it 22 

to a specified address range of an existing Memory Region that is enabled 23 
for use with Memory Windows. The range can include the entire Memory 
Region or any virtually contiguous subset of it. A Memory Window can 
only be bound to a Memory Region that belongs to the same Protection 

Domain. 26 

27 

01 0-61 : The CI shall enforce remote access control for Memory Win- 28 

dows with byte-level granularity. 29 

30 

When binding a Memory Window, a Consumer can request any combina- 
tion of remote access rights for the Window. However, if the associated 

Region does not have local write access enabled and the Consumer re- 32 

quests remote write or remote atomic access for the Window, the Channel 33 

Interface must return an error either at bind time or access time. See 34 

10.6.6.2.5 Error Checking at Window Bind Time and 10.6.6.2.6 Error 35 
Checking at Window Access Time . 

3g 

C10-62: If a Memory Region does not have local write access enabled, 

the CI shall return an error if a Memory Window Bind request specifies 38 

remote write or remote atomic access to that Region. The CI shall allow 39 

all other requested access rights for Memory Windows. 40 

41 
42 
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A Consumer is allowed and commonly expected to enable remote access 1 
rights when binding a Window that it may not have enabled when it regis- 2 
tered the underlying Region — provided it doesn't violate the above rule 3 
regarding local write access. For example, a Consumer might register a ^ 
Region with no remote access rights, and later bind one or more Windows 
to that Region that obviously would grant remote access rights. 

6 

Allocating or deallocating a Memory Window requires a kernel transition. 7 
and thus incurs the associated software overhead. Binding a Memory 8 
Window is performed with a Work Request posted to a send queue, and 9 
thus incurs far less software overhead with typical implementations. 

11 

01 0-63: Each time a given Memory Window is bound, the CI shall return 
an R_Key whose value is different from the immediate previous value. ^ ^ 
After the bind operation completes, any access attempts using the imme- ^ 3 
diate previous R_Key must fail. 14 

15 

When the Memory Window is bound, the Verb returns the new R_Key im- ^ g 
mediately after posting the Work Request, even though the actual binding ^ j 
operation performed by the HCA hasn't yet occurred. ^ ^ 

Implementation Note: an envisioned implementation for an R_Key is ^ 9 

to have it consist of two fields — an index field and a key field. The 20 

index field is used by the HCA to identify the associated Memory 21 

Window resource, and remains constant. The key field is changed 22 

each time the R_Key is bound, which guarantees that the immediate 23 

previous R_Key is invalidated as required. The use of a sufficient size ^4 
key field and suitable random number with each binding can provide 
some amount of protection against the holder of an invalidated R_Key 

being able to access the Memory Window without authorization. 26 

The Channel Interface software that prepares the Bind Work Request 

generates the new key value and places it in the Work Request for the 28 

HCA to record in its Memory Window resource when processing the 29 

request. This way, the new R_Key value is fully determined and can 30 

be returned to the Consumer prior to the HCA processing the request. 3^1 

It is not required that Channel Interfaces use this implementation. 32 

For correct operation, a Consumer must ensure that no remote agent at- 33 
tempts to use a new R_Key before its associated binding has been com- 34 
pleted by the HCA. One technique to accomplish this is for the Consumer 35 
to submit the Bind operation to the same Send Queue it uses to send the 35 
message that conveys the new R_Key to the remote agent. 

38 
39 

01 0-64: Any Work Request posted to a Send Queue subsequent to a ^0 
Bind Work Request shall not begin execution until the Bind operation 41 
completes. 42 



The Bind operation has a unique ordering rule: 
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If the HCA detects an error with the Bind operation, it will put the QP into 1 

an error state. With the technique described earlier, the Bind operation is 2 

guaranteed to complete before the remote agent can possibly receive the 3 

new R_Key. ^ 

An envisioned common usage model is for a Memory Window to be allo- 
cated once and then used for multiple bindings. When a previously bound ^ 
Memory Window is bound again, the previous R_Key and its associated 7 
bindings are automatically invalidated. Any remote agents needing to use 8 
the new Memory Window bindings must use the new R_Key. g 



10 
11 



If the Consumer wants to invalidate a Memory Window's bindings without 
deallocating the Window or enabling remote access to new areas, the 
Consumer can submit a Bind request specifying a length of zero. ^ ^ 

13 

01 0-65: After a zero-length Memory Window Bind completes, the CI shall 14 
not allow any remote access to be performed to that Memory Window 1 5 
until a subsequent Bind re-enables remote access. 



17 
18 



01 0-66: The CI shall support multiple Windows bound to the same 
Memory Region, each with independent remote access rights, and their 
associated areas shall be allowed to be overlapping or disjoint. 

20 

10.6.6.2.3 Rebinding or Deallocating Active Windows 21 

Under normal operation, it is improper for a Consumer to deallocate or 22 
change the binding of a Memory Window while it is being accessed by a 23 
remote agent. However, this can occur if remote agents misbehave, or it 24 
can occur under error recovery circumstances. 

25 

010-67: Any Remote Operation requests that are in process and actively 
using a Memory Window when its binding is changed must fail with a pro- 27 
tection violation. 28 

29 

010-68: Once the Bind operation has been reported to the Consumer as 3Q 
having completed, the Channel Interface must guarantee that no addi- 
tional accesses can be performed under the immediate previous binding. 

010-69: Any Remote Operation requests that are in process and actively 33 
using a Memory Window when it is deallocated must fail with a protection 34 
violation. 35 

36 

010-70: Once the Deallocate Memory Window Verb completes, the 
Channel Interface must guarantee that no additional accesses can be 
performed through that Memory Window while it remains deallocated. 

39 
40 
41 
42 
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10.6.6.2.4 Deregistering Regions with Bound Windows 1 

It is an error for a Consumer to deregister or reregister a Memory Region 2 

while it still has any Memory Windows bound to it. Such Windows are said 3 

to be "orphaned". The Channel Interface must handle this error case as 4 

follows. r- 

D 

The Channel Interface is allowed to detect this error case and return an 
error without carrying out the deregister or reregister operation. ^ 

8 

01 0-71 : If the CI allows a Memory Region deregister or reregister opera- 9 
tion to create orphaned Windows, the CI must guarantee that any remote 1 q 
accesses attempted through the orphaned Windows will undergo the ac- 
cess checks and enforcement described in 10.6.6.2.6 Error Checking at 
Window Access Time . 

13 

10.6.6.2.5 Error Checking at Window Bind Time 14 

The following checks must be performed at Memory Window "bind time", 1 5 

which is either when the Channel Interface is executing the Bind Memory 1 6 

Window Verb that prepares and queues the associated Work Request, or 1 7 

when the HCA is processing that Work Request. 1g 

19 
20 
21 

01 0-73: The Channel Interface must check and enforce that Memory 22 
Windows are allowed to be bound to the specified Memory Region. 23 

24 

010-74: The Channel Interface must check and enforce write permis- 25 
sions with the specified Memory Region, as described in 10.6.6.2.2 Re- 
mote Access Through Memory Windows . ^ 

010-75: The Channel Interface must perform address bounds checks 28 
and PD checks with regard to the specified Memory Region. 29 

30 
31 

When the HCA processes an inbound RDMA or Atomic request that ac- 32 

33 

01 0-76: The Channel Interface must check and enforce that the Memory 
Window and QP belong to the same PD. 35 

36 

010-77: The Channel Interface must check and enforce the address 37 
bounds and access rights associated with the Window. 38 



010-72: The Channel Interface must check and enforce that the Memory 
Window and QP belong to the same PD. 



10.6.6.2.6 Error Checking at Window Access Time 

When the HCA 
cesses a Window 



010-78: The Channel Interface must check and enforce the access rights 
associated with each accessed page. 



39 
40 
41 
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10.7 Work Requests 



CI 0-79: For any previously undetected error cases where the Consumer 
orphaned the Window as described in 10.6.6.2.4 Dereaisterina Regions 
with Bound Windows , the Channel Interface must check and enforce that 
any pages accessed are in some Memory Region that belongs to the 
same PD as the Window. 

The Channel Interface is not required to enforce that such pages are nec- 
essarily in the same Region to which the Window was bound. Again, it is 
strongly encouraged that the Channel Interface check and report these 
error cases at bind or deallocation time instead of access time. 



Work Requests are used to submit units of work to the channel interface. 
There are different types of work requests supported and are abstracted 
throughout the Verbs. 

How a work request targets its destination is dependent upon the work re- 
quest and QP type. The target memory location is contained in the work 
requests's remote node address information (in the case of RDMA and 
Atomics) or in the remote receive QP WR's scatter/gather list (in the case 
of Send/Receive). The target QP depends on the QP type. Connected 
QPs have the destination QP contained in the local QP context. Datagram 
QPs have the destination QP contained as part of the work request. Raw 
QPs don't target a specific QP at the destination. 



10.7.1 Creating Work Requests 



Work Requests are the only mechanism available to Consumers to gen- 
erate work on work queues. Work requests are used only to pass the op- 
eration from the Consumer to the CI. 

Work Requests are created by the Consumer above the Channel Inter- 
face using mechanisms provided by the OSV. 



10.7.2 Work Request Types 



There are five types of operations which may be posted to the Work 
Queues by the Consumer: 

Send/Receive 

RDMA Write 

RDMA Read 

Atomic Operations 

Bind Memory Window 

Atomic Operations, Sends, RDMA operations and the Bind Memory 
Window operation may all take place on the same QP. 



1 
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28 

29 
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10.7.2.1 Send/Receive 



10.7.2.2 RDMA 



CI 0-80: The CI shall support Send and Receive Operations on all Trans- 
port Service Types supported on the CI. 

Sends must be posted to the Send Queue. 

Receives must be posted to the Receive Queue. 

CI 0-81 : The responder's Receive QP shall consume a Work Request on 
reception of an incoming send message. 

CI 0-82: The CI shall provide segmentation and reassembly for RC and 
DC Transport Service Types. 

O10-38: If the CI supports RD Service, the CI shall provide segmentation 
and reassembly for RD. 



There are two types of RDMA: RDMA Read and RDMA Write. 

RDMA Read Operations are supported only on the two reliable Transport 
Service Types — Reliable Connection and Reliable Datagram. RDMA 
Write Operations are supported on the two reliable Service Types plus the 
Unreliable Connection Service Type. 

CI 0-83: The CI shall support RDMA Read Operations on the RC Trans- 
port Service Type. 

CI 0-84: The CI shall support RDMA Write Operations on the RC and UC 
Transport Service Types. 

O10-39: If the CI supports RD Service, the CI shall support both RDMA 
Read and Write Operations on the RD Transport Service Type. 

RDMA Read and RDMA Write requests are submitted to the Send Queue. 

CI 0-85: The responder's Receive Queue shall not consume a Work Re- 
quest for an incoming RDMA Read. 

CI 0-86: The responder's Receive Queue shall consume a Work Request 
when Immediate Data is specified in a successfully completed incoming 
RDMA Write. 

CI 0-87: The responder's Receive Queue shall not consume a Work Re- 
quest when Immediate Data is not specified in an incoming RDMA Write 
or the incoming RDMA Write was not successfully completed. 
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10.7.2.3 Atomic Operations 



The target address of an RDMA request is the remote node's virtual ad- 
dress, a valid R_Key and length. The R_Key must be associated either a 
Memory Region or a Memory Window containing that virtual address. 

Queue Pairs and Memory Regions or Memory Windows have RDMA 
Read attributes and RDMA Write attributes. These attributes are checked 
at the target end and are not checked at the source end. 

C10-88: The CI shall not transfer data from an RDMA operation into the 
target memory unless the RDMA operation is enabled for the target QR 



IB Atomic Operations are architected as an optional feature to enable 
high-performance synchronization for distributed applications running on 
multiple hosts on the IB fabric. 

Two operation types are supported: Compare & Swap and Fetch & Add. 
The operand size for these operations is 64 bits. It is the responsibility of 
the Channel Interface at the local endnode to do any transfomiation to 
match the endnode endian convention. 

O10-40: If the CI supports Atomic operations, the CI shall support two 
types of Atomic operations, Compare & Swap and Fetch & Add. 

O10-41: If the CI supports Atomic operations, the CI at the local endnode 
shall perform any byte ordering transformation required to match the en- 
dian endnode convention. 

O10-42: If the CI supports Atomic operations, the CI shall implement 
Fetch & Add using two's complement arithmetic without saving the carry. 

o10-43: If the CI supports Atomic operations, the CI shall return an error 
if the remote address of the Atomic operation is not aligned on a 64-bit 
boundary. 

It is up to the Consumer to interpret whether the numbers are signed or 
unsigned. 

Atomic Operations are supported only on the two reliable Transport Ser- 
vice Types — Reliable Connection and Reliable Datagram. 

O10-44: If the CI supports Atomic operations, the CI shall support Atomic 
operations for the RC Transport Service Type. 

O10-45: If the CI supports Atomic operations and the RD Transport Ser- 
vice Type, then the CI shall support Atomic operations for an RD QR 
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o10-46: If the CI supports Atomic operations, the CI shall not support 1 
Atomic operations on any other Transport Service Types other than RC 2 
and RD. 3 

4 

Atomic Operation requests are posted to the Send Queue. The Atomic 
Operation request is made using the Post Send Request Verb. The results 

are contained in the data segment. The completion status of the request ^ 

posted to the Send Queue indicates only if the Atomic Operation was sue- 7 

cessfully attempted. The Consumer must check the result to determine if 8 

a conditional operation took place. 9 

10 
11 

12 

If an HCA supports atomics, then all atomic operation requests made to 1 3 
that HCA, referencing the same physical memory, are guaranteed to ap- 14 
pear to be serialized with respect to each other. These operations may be 1 5 
directed at one or more queue pairs. ^ g 



o10-47: If the CI supports Atomic operations, the CI shall return the re- 
sults of the operation in the Data Segment. 



17 
18 



O10-48: If the CI supports Atomic operations, the CI shall provide the ap- 
pearance that all Atomic operation requests made to the same HCA, ref- 
erencing the same physical memory are serialized with respect to each ^9 
other. 20 

21 

Atomic operation requests made to an HCA are not guaranteed to be se- 22 
rialized with respect to RDMA operation requests made to it or other HCAs 23 
in the system, or with respect to operations performed by other system ^4 
components such as processors. Because of this behavior, if atomic op- 
erations on a particular area of memory are used to implement locks, all 
accesses to that memory must be done using atomic operations. In par- 26 
ticular, it is not safe to use an RDMA read or Send/Receive to see if a lock 27 
is held, and it is not safe to use an RDMA write or Send/Receive to clear 28 
a lock. 



29 
30 
31 



Optionally, some systems may choose to provide a stronger guarantee: 
that all atomic operation requests made to all HCAs in the system, as well 
as all atomic operations performed on memory by other system compo- ^2 
nents such as processors, referencing the same physical memory, are 33 
guaranteed to appear to be serialized with respect to each other. Again, 34 
these operations may be directed at one or several separate queue pairs. 35 
The definition of an "atomic operation" as performed by a system compo- 
nent which is not an HCA is implementation-dependent; for instance, a 
processor might be required to execute a particular instruction to produce 
an atomic operation. 38 

39 

O10-49: If the CI supports Atomic operations and the system provides 40 
Atomic access across the system, the CI shall provide the appearance 41 

42 
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that all Atomic operation requests that reference the same physical 1 
memory are serialized with respect to each other. 2 

3 
4 

The Bind Memory Window operation associates a previously allocated 5 
Memory Window to a specified address range within an existing Memory g 
Region, along with a specified set of remote access privileges. ^ 

Bind Operations are supported only on the Reliable Connection, Unreli- ^ 
able Connection, and Reliable Datagram Service Types. 9 

10 

010-89: The CI shall support Bind operations for RC and UC Transport 11 
Service Types. -I2 

13 
14 
15 

Bind operations must be posted to the Send Queue. Binds affect only local 1 6 
HCA memory mapping resources and do not cause any packets to be is- 17 
sued over the link. No resources at the destination QP are affected. 

19 
20 

A Work Request contains all of the information required to perform the re- 21 
quested operation. 22 

The contents of a Work Request for an operation posted to the Send ^3 

Queue are described in Section 11.4.1.1 Post Send Request on page 496 . 24 

The contents of a Work Request for an operation posted to the Receive 25 

Queue are described in Section 11.4.1.2 Post Receive Request on page 26 

501. 27 

28 
29 

Work Requests always generate a Work Completion by default. This is re- 
ferred to as a Signaled Completion. There is a mechanism where Work 
Requests posted to the Send Queue may not generate a Work Comple- 
tion in the associated Completion Queue. This is referred to as an Unsig- 32 
naled Completion. In order to use Unsignaled Completions, the QP has to 33 
be configured to support Unsignaled Completions and the Work Request 34 
must use the Signaling Indicator to request an Unsignaled Completion. 35 
Note that if a completion error occurs, a Work Completion will always be gg 
generated, even if the signaling indicator requests an Unsignaled Com- 
pletion. 

38 

C10-90: The CI shall support both signaled and unsignaled completions. 39 

40 

010-91 : The CI shall generate a CQE when a Work Request completed 41 
under any of the following conditions: 42 



10.7,3.1 Signaled Completions 
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10.7.3.2 Scatter/Gather 



• The Work Request completed in error. 

• The Work Request was submitted to the Receive Queue. 

• The Work Request was submitted to a Send Queue configured for 
only Signaled Completions. 

• The Work Request was submitted to a Send Queue configured for 
Unsignaled Completions but the Work Request requested a Signaled 
Completion. 

01 0-92: The CI shall not generate a CQE when all of the following con- 
ditions have been met for a completed Work Request that was submitted 
to the Send Queue: 

• The Send Queue has been configured to support Unsignaled Com- 
pletions. 

• The Work Request submitted to that Send Queue set the Signaling 
Indicator to request an Unsignaled Completion 

• That Work Request completed successfully. 

Work Requests using Unsignaled Completions can be determined to have 
been completed according to the rules in 10.8.6 Unsignaled Completions . 



A scatter/gather list may contain zero or more Data Segments. The 
buffers specified in a Work Request scatter/gather list must be registered 
with the Channel Interface prior to submission. These buffers must be 
considered to be in the scope of the Channel Interface from the time sub- 
mitted to a work queue until completion of the Work Request has been 
confirmed. See 10.8.5 Returning Completed Work Requests and 10.8.6 
Unsignaled Completions for a full description on when the completion of 
a Work Request is confirmed. 

01 0-93: If the total sum of all of the buffer lengths exceeds the maximum 
message payload size specified for an RC or UC QP, the CI shall report 
an error. 

O10-51: If the CI supports RD Service, and if the total sum of all of the 
buffer lengths exceeds the maximum message payload size specified for 
an RD QP, the CI shall report an error. 

A Data Segment is defined by a Virtual Address, L_Key and Length. 

010-94: The CI shall support scatter lists for Receive and RDMA Read 
operations. 

010-95: The CI shall support gather lists for Send and RDMA Write op- 
erations. 
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The order in which the Channel Interface accesses the memory described 1 

by a scatter/gather list is not defined by the architecture. In particular, this 2 

means that after completion of a Work Request whose scatter list contains 3 

overlapping Data Segments, the contents of the overlapped memory are ^ 



undefined. 



10.8,2 Submitting Work Requests to a Work Queue 

Work Requests are submitted to the HCA through the Verbs abstraction. 



involvement. 



Return an immediate error if the QP is in the Reset, Init and RTR 
states. 

Are processed when the QP is in the RTS state. 



5 



10.8 Work Request Processing Model ^ 
10.8.1 Overview I 

o 

The Work Request processing model describes how requests are sub- g 
mitted, processed by the HCA, and the results returned to the Consumer 



10 
11 
12 
13 

Work Queue Elements are abstract. This means that they are not acces- 
sible directly by the Consumer of the Channel Interface. 1 5 

16 

The intent of the architecture is to allow an implementation to pass Work u 
Requests from a User-level Consumer process to the HCA without kernel ^ g 



19 



The QP can accept Work Requests only when the QPs are in states that 
allow them to be submitted. The rules are as follows: 21 

22 

01 0-96: The QP shall process Work Requests submitted to the Send 23 
Queue as described in the rules that follow: 24 

25 
26 
27 
28 

• Are completed in error, assuming that processing is able to continue 29 
when the QP is in the SQEr or Error state. 

• Are enqueued but not processed when the QP is in the SQD state. 3^ 

01 0-97: The QP shall process Work Requests submitted to the Receive 32 
Queue as described in the rules that follow: 33 

34 
35 
36 
37 
38 
39 
40 
41 
42 
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• Return an immediate error if the QP is in the Reset state. 1 

• Are accepted, but incoming messages are not processed when the 2 
QP is in the Init state. 3 

• Are processed when incoming messages arrive and the QP is in the ^ 
RTR, RTS. or SQD state. 5 

• Are completed in error, assuming that processing is able to continue 
when the QP is in the SQEr or Error state. ^ 

8 

The modifiers in the Work Request are instantiated into the next free WQE 
in the specified Work Queue and the CI is informed that a new WQE has ^ 
been added to the queue. 1 0 

11 

Figure 1 1 8 shows the transformation of a Work Request into a WQE to be 12 
processed by the HCA. ^3 

14 
15 

P'^^Pf^ Abstracted HCA 1 6 

ReqSSts ^1 Work Queue Hardware 

18 
19 
20 
21 

Figure 118 Work Queue Abstraction 
10.8.3 Work Request Processing 23 

24 

Processing of Work Requests submitted to a Work Queue are initiated in 

the order submitted. There is no ordering between WRs submitted to the ^5 

send queue and WRs submitted to the receive queue. Send WRs are ini- 26 

tiated in the same order they were passed to the Verbs layer with respect 27 

to other sends WRs submitted to the same send queue. Likewise, receive 28 

WRs are initiated in the same order they were passed to the Verbs layer 29 
with respect to other receive WRs submitted to the same receive queue, 

CI 0-98: The CI shall initiate Work Requests submitted to a single queue 
in the order in which those Work Requests were submitted to that queue. 32 

33 

CI 0-99: For all Service types except RD, Work Requests submitted to the 34 
same Receive Queue shall complete in the same order in which they 35 
were submitted. 3g 

37 

Resources associated with a Work Request must be considered to be in 
the scope of the Channel Interface from the time the Work Request is sub- 

mitted to a Work Queue until the completion for that Work Request has 39 

been confirmed. See 10.8.5 Returning Completed Work Requests and 40 

1 0.8.6 Unsianaled Completions for a description of when a Work Request 41 

completion is confirmed. 42 
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Work Requests submitted to a single Work Queue complete in the same 1 
order as the requests were submitted, according to the Ordering Rules. 2 

3 

The exception to this rule is that reliable datagrams are permitted to com- ^ 
plete out of order on the Receive Queue. 

5 

Reliable datagrams originating from a specific Send Queue complete ^ 
in the same order they were submitted when they are sent to the 7 
same Receive Queue. 8 

O10-52: If the CI supports RD Service, Work Requests submitted to the 9 
same RD Send Queue shall complete in the same order in which they 10 
were submitted. 11 

12 

o10-53: If the CI supports RD Service, Work Requests submitted to the 
same RD Receive Queue shall complete in the same order in which they 
were submitted when the requests originate from the same remote RD 
Send Queue. '^^ 

16 

Receive completions from reliable datagrams sent from multiple Send 17 
Queues are allowed to be interleaved on the Receive Queue. ig 



Here are the cases where the Fence Indicator can be used to guarantee 
in-order semantics: 



19 
20 



As shown in Table 67 : Work Request Operation Ordering , ordering se- 
mantics for WRs submitted to the Send Queue vary according to the op- 
eration type. Some operations can begin processing within the CI while 
other operations are still outstanding, potentially yielding out-of-order se- 22 
mantics for certain operation sequences. For cases enumerated below, 23 
in-order semantics can be guaranteed by setting the Fence Indicator for 24 
appropriate WRs. When the Fence Indicator is set for a given WR, that 25 
WR cannot begin to be processed until all prior RDMA Read and Atomic 2g 
operations on the same Send Queue have completed. 

01 0-1 00: When the Fence Indicator has been set in a Work Request, the 28 

Send Queue shall not begin processing that Work Request until all prior 29 

RDMA Read and Atomic Operations on that Send Queue have com- 30 

pleted. 31 

32 
33 
34 

An RDMA Read won't necessarily complete before subsequent 35 
Sends, RDMA Writes, or Atomics are initiated and observed by the 36 
target. If the target Consumer then modifies memory locations being 37 
returned by the RDMA read, the RDMA read could return the newly 33 
modified data instead of the original data. Setting the Fence Indicator 

40 
41 
42 
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for the subsequent operation in each case guarantees that the opera- 1 

tion will not be observed by the target until all prior RDMA Reads 2 

complete. 3 

An RDMA Read can return data that's been modified by subsequent 4 

Sends, RDMA Writes, or Atomics if they target memory locations be- 5 

ing returned by the RDMA Read. Setting the Fence Indicator on the g 
subsequent operation in each case guarantees that the operation will 
not affect data being returned by a prior RDMA read. 



10 
11 



7 
8 

• RDMA Read or Atomic operations won't necessarily complete before g 
subsequent Sends. RDMA Writes, or Atomics are initiated and ob- 
served by the target. If one of the former operations completes in er 
ror on the initiator side because its ACKs fail to return successfully, 
the subsequent operation could still be observed by the target, and 12 
the target Consumer might take some undesired action. Setting the 13 
Fence Indicator on the subsequent operation in each case guaran- 14 
tees that it can't be observed by the target unless all prior RDMA ^ 5 

Reads and Atomics complete successfully on the initiator side. ^„ 

1d 

The Bind operation has a unique ordering rule: any Work Request posted 17 
to a Send Queue subsequent to a Bind must not begin execution until the 
Bind operation completes. However, note that a Bind operation itself can 
begin execution in some cases before prior operations have necessarily 
completed. 20 

21 

Ordering guarantees for processing and completion notifications exist 22 
only between Work Requests submitted to the same queue. The ordering 23 
across multiple Work Queues is undefined. 24 

25 

C10-101: The CI shall provide the guarantees for processing and com- 
pletion notifications between Work Requests submitted to the same Send ^6 
Queue as specified by the ordering rules in Table 67 . 27 

28 

Ordering Rules: 29 

30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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Receive Queues are FIFO queues with the exception of the reliable 
datagram issue described above. 

• Send Queues are FIFO queues, according to the rules in Table 67 : 
Work Request Operation Orderino . The Fence Indicator can be used 
to require strict ordering. 

Table 66 

Table 67 : Work Request Operation Ordering 
Second Operation 







Send 


Bind Window 


RDMA Write 


RDMA Read 


Atomic Op 




Send 


# 


# 


# 


# 


# 


ation 


Bind Window 


F 


# 


# 


# 


# 


Oper 


RDMA Write 


# 


# 


# 


# 


# 


First 


RDIVIA Read 


F 


F 


F 


# 


F 




Atomic Op 


F 


F 


F 


# 


F 



Table 68 



Table 69 : Ordering Rules Key 


Symbol 


Description 


# 


Order is always maintained. 


F 


Order maintained only if second operation has Fence Indicator set 



10.8.4 Completion Processing 



The results from a Work Request operation are placed in a Completion 
Queue Entry (CQE) on the CQ associated with the Work Queue when the 
request has completed. 

A CQE must be generated before a Work Completion can be returned to 
the Consumer. Note that not all Work Requests will generate a comple- 
tion, due to unsignaled completions. The rules for when a CQE is gener- 
ated are outlined in 10.8.5 Returning Completed Work Requests . 

01 0-1 02: For completed Work Requests that generate a Work Comple- 
tion, the CI shall place that Work Completion on the CQ associated with 
the Work Queue. 

A CQE is an internal representation of the Work Completion. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 
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32 

33 
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35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 424 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Software Transport Interface October 24, 2000 

Volume 1 - General Specifications FINAL 

10.8.5 Returning Completed Work Requests i 

All completions are abstracted through the Verbs. The only method of re- 2 
trieving a Work Completion is through the Verbs. 3 

4 

Completions are always returned in the order submitted to a given work 5 
queue with respect to other Work Requests on that work queue. Ordering g 
rules of completion entries from multiple work queues associated with a 
given completion queue are not mandated by this specification. 

8 

A retrieved Work Completion is no longer in the domain of the Channel In- 9 
terface. Therefore, a Work Completion can only be retrieved once. 10 

11 

01 0-1 03: The CI shall not allow a specific Work Completion to be re- ^2 
trieved more than once. ^ 2 

14 

The Work Completion contents are specified in 11.4.2.1 Poll for Comple- 
tion on page 502 . ''^ 

16 

A Consumer can find out when a Work Completion can be retrieved 17 
through polling or notification. 

19 
20 
21 

01 0-1 05: The CI shall return a Work Completion for a Work Request sub- 22 
mitted to a Send Queue that completed in error. 23 

24 

010-106: The CI shall return a Work Completion for the completion of a 25 
Work Request submitted to a Receive Queue. 26 

97 

A Work Request is confirmed when the associated Work Completion is re- 
trieved from its CQ. 28 

29 

01 0-1 07: The CI shall not access any buffers associated with the Work 30 
Request once the associated Work Completion has been retrieved. 3^ 

32 
33 

One of the modifiers returned with the completion is a count that informs 3^ 
the Consumer of the number of work request resources freed by this com- 
pletion. This applies only to Reliable Datagram Receive Queues. Work 
request resources refers to Channel Interface resources allocated on be- 
half of the Consumer, such as available WQEs for a given Work Queue, 37 
and not direct Consumer resources, such as buffers. 38 

39 

If this count is zero, this indicates that no receive queue work queue ele- 40 
ments have been freed when this Work Completion was generated. If this ^ ^ 
count is greater than zero, the Consumer can assume that the counter in- 



010-104: The CI shall return a Work Completion for a Work Request that 
completed with a signaled completion. 



10.8.5.1 Freed Resource Count 
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dicates the number of work requests released from the RD RQ. This is 1 
useful for the Consumer to keep track of the number of available work re- 2 
quests which can be outstanding. 3 

4 

Buffers associated with the outstanding work request associated with this 
work completion are no longer considered to be in the scope of the HCA, 
regardless of the Freed Resource Count. 6 

7 

For most implementations, this count is expected to be one with every 8 
work completion. 9 



10 
11 



O10-54: If the CI supports RD Service, when a Work Completion associ- 
ated with a Work Request posted to an RD RQ is retrieved, the CI shall 
return a count of the number of Work Request resources freed through the ^ ^ 
Verbs. 13 

14 

10.8.6 Unsignaled Completions 15 

An unsignaled Work Request that completed successfully is confirmed 16 
when all of the following rules are met: 1 7 

18 

• A Work Completion is retrieved from the same CQ that is associ- ^ g 
ated with the Send Queue to which the unsignaled Work Request 
was submitted. 

21 

• That Work Completion corresponds to a subsequent Work Re- 22 
quest on the same Send Queue as the unsignaled Work Request. 

23 

01 0-1 08: The CI shall not access buffers associated with an Unsignaled 24 
Work Request once a Work Completion has been retrieved that corre- 
sponds to a subsequent Work Request on the same Send Queue. 

10.8.7 Asynchronous Completion Notification 27 

The Consumer may register a completion notification routine to be called 
when a new entry is added to the CQ using the Set Completion Event 29 
Handler Verb. 30 

31 

01 0-1 09: A CI shall support registering a single CQ Event Handler per 32 
HCA. 33 

C10-110: The CI shall replace any previous handler with the handler 
specified in a new Request Completion Event Verb invocation. 

36 

The Request Completion Event Verb is set on a CQ basis. This is a one- 37 
shot notification; at most, one notification will be generated per call to this 33 
Verb. Once CQ notifications have been enabled, additional Request Com- 39 
pletion Event calls have no effect. The handler will be called once when 
the next entry is added to the CQ specified as a modifier to this Verb. The 
presence of Solicited Events may impact this behavior. See 11.4.2.2 Re- 



40 
41 
42 
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10.9 Partitioning 



quest Completion Notification on page 506 & 9.2.3 Solicited Event (SE) - 
1 bit on page 202 for details. 

C10-111: A CQ shall have at most one Completion Event notification re- 
quest outstanding. 

010-112: A CI shall generate a single Completion Event when a Work 
Completion that satisfies the outstanding Completion Event request is 
added to the CQ. 

01 0-1 13: A CI shall not generate a Completion Event for existing Work 
Completion entries on the specified CQ at the time the completion notifi- 
cation request is registered. 

A notification will not be generated until the next entry is added to the CQ. 

The following sequence of calls should be used when using Request 
Completion Notification in order to ensure that a new CQ entry is not 
missed for the specified CQ. 

1 ) Poll for Completion to dequeue existing CQ entries. 

2) Request Completion Notification. 

3) Poll for Completion to pick up any CQ entries that were added be- 
tween the time the first Poll for Completion was called and the notifi- 
cation is enabled. 

If a handler has not been registered, a notification will not be generated. 

When the handler routine is Invoked, an indication of which CQ has gen- 
erated the completion notification will be supplied. Once the handler rou- 
tine has been Invoked, the Consumer must call Request Completion 
Notification again to be notified when a new entry is added to the CQ. 

010-114: For each Completion Event, the CI shall indicate which CQ 
caused the generation of that event. 

The Consumer is responsible for polling the CQ to retrieve the work com- 
pletion. This function is not performed automatically when the notification 
occurs. 



This section discusses InfiniBand™ support for partitioning of an Infini- 
Band™ network. The Verb support for partitioning is contained in the 
Verbs that perform Queue Pair management, read Channel Interface (CI) 
content, and set it. These are documented in 11.2 Transport Resource 
Management on page 449 . 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfinlBand^'^ Trade Association 



Page 427 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 
Volume 1 - General Specifications 



Software Transport Interface 



October 24, 2000 
FINAL 



10.9.1 Introduction 



In this discussion, the term "Partition Manager" (PM) refers to the function 
of the Subnet Manager that deals with partitioning for the CI being dis- 
cussed; see 1 3.5 MAD Processing on page 599 for how SMPs are di- 
rected to that manager. 



Partitioning enforces isolation among systems sharing an InfiniBand™ 
fabric by requiring that packets contain a 16-bit Partition Key (P_Key) 
which must match a P_Key stored at the receiver or be discarded; see 
definition of "match" below ( 10.9.3 Partition Key Matching ). There are no 
Verbs that directly set the P_Keys sent or matched against in a CA. Verbs 
instead specify an index into a table of P_Keys: the P_Key_ix, specifying 
an entry in the P_Key Table. The contents of the P_Key Table are con- 
trolled by the subnet's Partition Manager (PM), which sets them using 
Subnet Management Packets (SMPs) sent through the subnet's Subnet 
Manager. 

Subsections appearing below describe the P_Key Table, the matching 
process, and the way P_Keys are attached to packets. See 14.2.5 At- 
tributes on page 626 for a description of the SMPs which set entries in the 
P_Key Table. 



10.9.1.1 Limited and Full Membership 



10.9.1.2 Special P Keys 



A collection of endnodes with the same P_Key in their P_Key Tables are 
referred to as being members of a partition, or in a partition. A P_Key 
Table can specify one of two types of partition membership: Limited or 
Full. The high-order bit of the partition key is used to record the type of 
membership in a partition table: 0 for Limited, and 1 for Full. Limited mem- 
bers cannot accept information from other Limited members, but commu- 
nication is allowed between every other combination of membership 
types. 



There are P_Keys that have special meaning: the default partition key, 
and the invalid partition keys. 

C10-115: The P_Key value OxFFFF shall represent the default partition 
key. 

The PM should not use this P_Key value for any other purpose. The de- 
fault partition key provides Full membership in the default partition. 

C10-116: The CI shall regard a P_Key as invalid if its low-order 15 bits 
are all zero.The CI shall mark a table entry as invalid by filling it with an 
invalid P_Key. 
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C10-117: The PM must not use these two P_Key values for any other 1 
purposes. 2 

3 

Any P_Key which is not invalid is referred to as valid. The default partition ^ 
key is valid. A P_Key Table entry containing a valid P_Key is referred to 
as a valid P Key Table entry. 

6 

1 0.9.1 .3 Operation Across Subnets 7 

C10-118: Switches or Routers shall not modify P_Key values when ^ 
packets are forwarded/routed within or between subnets. 9 

10 

C10-119: A packet's P_Key must nnatch a P_Key stored at the destination -|i 
CI or CA, or the packet shall be discarded; see the definition of "match" ^2 
below ( 10.9.3 Partition Kev Matchino ). ^ ^ 

In the above case a P_Key sourced in one subnet must be valid in another 

subnet. Since subnets may have different PMs, this must be arranged to 15 

happen, for example by human administration (analogous to assignment 16 

of static IP addresses) or by a program dialog between subnets' PMs. The 1 7 

definition of the messages used in such an inter-PM dialog is beyond the -^g 

scope of this version of the specification. ^ g 

10.9.2 The Partition Key Table (P Key Table) 

21 

CIO-I2O: Each HCA port and switch SMA port shall contain a Partition 
Key Table (P_Key Table). The valid entries in the P_Key Table shall hold 
P_Keys for all the endnodes with which this CI can communicate. 

24 

If a switch or router supports the optional P_Key Enforcement feature, 25 
then each of its ports shall contain a Partition Key Table (P_Key Table). 26 



27 
28 



C10-121: The P_Key Table size, meaning the maximum number of en- 
tries it can hold, must be greater than or equal to one and less than or 
equal to 65535. 

30 

The maximum number of entries that can be held in a P„Key Table can 31 
be obtained by using the Query HCA Verb or the Nodelnfo SMP. (See 32 
11.2.1.2 Query HCA on oaoe 449 and 14.2.5.3 Nodelnfo on pace 630 .) 33 



34 
35 



01 0-1 22: The CI must not provide any interface which allows software 
above the Verbs to alter the P_Key Table contents or change the validity 
of any entry in the P_Key Table, except through the use of SMPs. 

37 

Verbs allow host software to read entries in the P_Key Table. If the value 38 
read is an invalid partition key value, that entry is invalid. 39 

40 
41 
42 



InfiniBand^'^ Trade Association 



Page 429 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Software Transport Interface October 24, 2000 

Volume 1 - General Specifications FINAL 

SMPs sent to the endnode are used to read and write entries in the P_Key 1 
Table. The operations involved when a table is written are described in a 2 
later section. 



Also see 9.6.1.1.3 BTH:P Kev on page 232 and 18.2.1 Attributes on page 
814. 



If: 



CI 0-1 23: If non-volatile storage is not used to hold P_Key Table contents, 
then if a PM (Partition Manager) is not present, and prior to PM initializa- 
tion of the P_Key Table, the P_Key Table must act as if it contains a single ^ 
valid entry, at P_KeyJx = 0, containing the default partition key. All other 7 
entries in the P_Key Table must be invalid. 8 

9 

10.9.3 Partition Key Matching ^0 

CI 0-1 24: The P_Key field of incoming packets received by an endnode 11 
shall be matched against a resident P_Key as described in the remainder ^ 2 
of this section. ^2 

14 
15 
16 

In the following, let M_P_Key (Message P_Key) be the P_Key in the in- 17 
coming packet and E_P_Key (Endnode P_Key) be the P_Key it is being 
compared against in the packet's destination endnode. 

20 
21 

neither M_P_Key nor E_P_Key are the invalid P_Key, 22 

• and the low-order 1 5 bits of the M_P_Key match the low order 1 5 23 
bits of the E_P_Key; 24 

• and the high order bit (membership type) of both the M_P__Key 25 
and E_P_Key are not both 0 (i.e., both are not Limited members 26 
of the partition) 2/ 

then the P_Keys are said to match. In this case the incoming packet 28 
is accepted and processed normally. 29 

In all other cases the P_Keys are said to not match. The incoming 30 
packet must be treated as if it was sent to a nonexistent device, 3^ 
meaning: 22 

• no ACK is returned 33 

optionally, a trap SMP is sent to the SM and a counter is incre- 34 
mented; see 10.9.4 Bad P Kev Trap and P Kev Violations 35 
Counter (Optional) 3g 

• there is no other effect on the target endnode. 37 

10.9.4 Bad P_Key Trap and P_Key Violations Counter (Optional) 38 

0IO-55: If the CA supports the trap SMP for P_Key Violations, then if a ^® 
packets P„Key does not match, the destination node shall send a trap ^0 
SMP to the SM, specifying the partitioning class and the Bad P_Key No- 41 

42 
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tlfication method. The body of the trap SMP must contain the header(s) 1 
of the offending packet. Like all traps, this one shall not be sent at a fre- 2 
quency faster than the Subnet Timeout. 3 

4 

O10-56: If the CA supports the trap SMP for P_Key Violations, then if an- 
other P_Key mismatch occurs before the trap can be sent, the data for the 
new mismatch shall replace the previously stored data. ^ 

7 

O10-57: If the CA supports a P_Key Violations counter, then it shall have 8 
the following characteristics: g 

10 

Its minimum size is one bit; its maximum size is 16 bits (unsigned). 

It is incremented whenever the P_Key on a message arriving on a 12 
given port does not match (as described in 10.9.3 Partition Kev ^ ^ 
Matching ). 

14 

• When its value reaches all Is, further incrementing does not change 
its value: i.e., it saturates. 

i D 

It is initialized by power on reset to zero. 1 7 

The P_Key Violations counter can be read and set by using a SMP that 18 
accesses P_keyViolations component of the Portlnfo attribute; see ^ 9 

14.2.5.1 Notice on page 628 . 20 

21 
22 

C10-125: Except for the subnet management QP (QPO) and QPs pro- 23 
viding RD (Reliable Datagram) or Raw Datagram service, a P_Key must 
be associated with each QP before the QP is used. If a CI has multiple 
ports, the P„Key Table to which the P_Key index refers shall be the ^5 
P_Key Table of the port that the QP is currently using. 26 

27 

This association is done through Verbs that specify the P_Key_ix of the 28 
key to use. 29 

C10-126: The CI shall attach a QP's P_Key to all packets sent from the ^° 
QP's send queue, except for SMPs, raw datagram packets and packets 

sent from RD QPs. 32 

33 

SMPs are always sent with the default P_Key, Raw datagram packets do 34 

not contain a P_Key, and packets from an RD QP get their P_Keys from 35 

the EE context associated with the RD QP. 2g 

C10-127: The CI shall compare the QP's P_Key to the P_Key contained 
in all incoming packets, except for raw packets and packets destined for 38 
QPO, QP1, and QPs providing RD service. 39 

40 
41 
42 
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The connparison is described in 10.9.3 Partition Key Matching . The excep- 1 
tions to this are described in 10.9.8 Partition Enforcement on Manage- 2 
ment Queue Pairs and 10.9.5.1 EE Context (Reliable Datagram) Support . 3 

4 
5 

01 0-58: If the CI supports the RD Service, then it must associate a P_Key g 
with each EE Context before the EE Context is used. If a CI has multiple 
ports, the P_Key Table to which the P_Key index refers shall be the 
P_Key Table of the port that the EE Context is currently using. ^ 

9 

O10-59: If the CI supports the RD Service, then the CI must attach an EE 10 
Context's P_Key to all outgoing Reliable Datagram (RD) packets emitted ^ 
using that EE Context. All incoming packets using a given EE context ^2 
shall be compared with that EE Context's P_Key as described in 10.9.3 ^ 
Partition Kev Matching . 

14 

As stated in that section: if the P_Keys match, the packet is processed 1 5 
normally; otherwise it is silently discarded and, optionally, a trap is issued 16 
and the Bad P_Key Counter is incremented as described in that section. 17 

18 

RD service is not used on management queue pairs, so this EE Context 
support does not apply to them. 

10.9.5.2 Partition Key Changes 21 

2? 

C10-128: When the PM sends a message to a CI port requesting a 
change to the value of a P_Key Table element, the CI must return a re- 23 
sponse message indicating that the action has either been carried out 24 
successfully or not performed for some reason. 25 

26 

C10-129: The CI shall guarantee that, after the point in time when it 27 
sends a response message to the PM indicating success, the updated 
P_Key Table values will be used to process all subsequent incoming and 
outgoing packets traversing the associated port. 29 

30 

This behavior may have begun prior to the PM's receiving the success 31 
reply. 32 

33 
34 

01 0-1 30: TCA support for partitioning must be the same as that for CIs, 35 
with the exception that association of a P_Key with a queue shall be done 3g 
in response to messages that initiate creation of queue pairs, as part of 
establishing communication with another endnode. 

38 

In all other respects, the TCA behaves exactly like a CI in terms of multiple 39 
ports, incoming packets, outgoing packets, and changes to the P_Key 40 
Table. 

42 
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10.9.8 Partition Enforcement on Management Queue Pairs 



The two types of management queues each treat partition enforcement in 
a different way. 



10.9.7 Fabric Partition Support i 

The switches in the InfiniBand™ fabric may optionally also enforce parti- 2 

tioning. How P_Keys are loaded into switches and how they are used is 3 

described in several sections of the chapter describing switches 4 

( 18.2.4.2.1 Inbound P Key Enforcement on oace 818 and 18.2.4.4.1 Out- 5 

bound P Key Enforcement on page 826 . g 

7 
8 
9 

10 

CI 0-1 31 : Packets sent to the Subnet Management Interface QP shall al- ^ ^ 
ways be accepted, regardless of the P__Key contained in the packet. 1 2 

13 

Isolation and security of management communication are not provided by 1 4 
partitioning, but instead by checking of the Management Key. >| g 

1 R 

Packets sent from a Subnet Management Interface QP may have any 
P-Key; the default P_Key is used by convention, as described in the man- ^ ^ 
agement sections. ^8 

19 

CI 0-1 32: Packets sent to the General Service Interface QP (QP1) shall 20 
be accepted if the P_Key in the packet matches any valid P_Key in the 21 
P_Key Table of the port on which the packet arrived. Matching is defined ^2 
in 10.9.3 Partition Key Matchino . 

As stated in that section: if the P_Keys match, the packet is processed 24 
normally; otherwise it is silently discarded and, optionally, a trap is issued 25 
and the Bad P_Key Counter is incremented as described in that section. 26 

27 
28 
29 
30 

CI 0-1 34: Each switch shall also check P_Keys on its GSI QP. Switches 31 
shall support a P_Key table with at least one entry against which the 32 
P_Key of packets destined for the switch's GSI shall be matched, ac- 33 
cording to the rules as stated in C10-132: above. 34 

35 
36 

Checking of the M_Key (see 14.2.4 Management Key on page 623 ) can 3^ 
optionally be used to prevent anything but an authorized subnet manager 
from reading any SM data from the SMI, and when the protection test fails, 
silently discarding the packet that failed. Similarly preventing the writing of ^9 
SM data through the SMI, with silent discard, is mandatory. 40 

41 
42 



CI 0-1 33: Packets sent from the Send Queue of a GSI QP shall attach a 
P_Key associated with that QP, just as a P_Key is associated with non- 
management QPs. 



10.9.9 Related Enforcement of Management Message Checking 
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In addition, it is an option to store the M_Key(s), the M_KeyProtectBits 1 

which control M_Key checking, and the lease period across power cycles 2 

Table 127 Portlnfo on page 634 .Thus. for example, system initialization 3 

techniques cannot assume that a constant default value for that data is ^ 
present except for first-power-on from the factory. 

5 

10.10 Error Handling Semantics and Mechanisms ^ 

7 
8 
9 
10 

10.10.1 Error Types ii 

Three classes of errors reported through the Verbs have been defined: im- "1 2 

mediate errors, completion errors and asynchronous errors. Each of these 1 3 

error classes are described in more detail under their respective headings 1 4 

within 10.10.2 Error Handling Mechanisms . A brief description of each -| 5 

16 
17 
18 

Completion errors are returned to the Verbs Consumer as status within a 1 9 
Work Completion. 20 

21 

Asynchronous errors are returned through an event handling mechanism. 22 

23 
24 

This section describes the mechanisms used to notify the Verb Consumer 25 
of errors in the requested operations. 

10.10.2.1 Immediate Errors 27 

01 0-1 35: The CI shall return Immediate errors upon return of control from 
the Verb to the Consumer. 29 

30 

The details of these error types are included with each Verb described in 31 
the Verbs chapter. 32 

33 
34 



error class follows. 
Immediate errors are returned as status from the Verbs. 



10.10.2 Error Handling Mechanisms 



01 0-1 36: If an immediate error is returned from a Verb involved in posting 
Work Requests to a queue, the CI shall ensure that the Work Request 
has not been posted to the queue. 

36 

1 0.1 0.2.2 Completion Errors 37 

01 0-1 37: A Work Request or WQE that is "completed in error" shall have 38 
the appropriate completion error returned in the Work Completion status. 39 

40 
41 
42 
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The complete list of errors that can be returned in the Work Completion 1 
status is described in the Verbs chapter under the Completion Queue Op- 2 
orations ( 11.4.2.1 Poll for Completion on page 502 ). 3 

4 

There are two classes of completion errors: Interface checks and pro- 
cessing errors. An interface check is an error in the information supplied ^ 
to the Channel Interface detected before data is placed onto the link. A 6 
processing error is an error encountered during the processing of the work 7 
request by the Channel Interface. 8 

9 

10 

Consumers are notified about asynchronous errors through an asynchro- ^ 
nous notification mechanism. In order to be notified when asynchronous ^2 
errors occur, the Consumer must register a handler using the Set Asyn- ^ 
chronous Event Handler Verb. 

14 

01 0-1 38: After the asynchronous event handler is registered, all subse- 1 5 
quent asynchronous errors shall result in a call to the error handler. Asyn- 1 6 
chronous errors that occur before the error handler is registered shall be 17 
lost. 



10.10.2.3 Asynchronous Errors 



There are two Asynchronous error types: 



18 
19 
20 



The details of these errors are discussed in 11.6.3.2 Affiliated Asynchro- 
nous Errors on page 514 and 11.6.3.3 Unaffiliated Asynchronous Errors 

on page 515 21 

22 

01 0-1 39: Only one error handler shall be registered per HCA. Subse- 23 

quent calls to the Set Asynchronous Error Handler Verb shall cause the 24 

previous handler address to be overwritten with the new handler address. 25 

26 
27 

• Unaffiliated Asynchronous Error. Not related to any specific WQ 28 
or CQ. 29 

010-140: Unaffiliated Asynchronous Errors are handled according to 30 
type: local catastrophic errors shall place all QP/EEs in the Error State; 31 
local port errors shall have no effect on QP/EE State. 32 

33 

• Affiliated Asynchronous Error. Related to a specific WQ, CQ or 
EE context and unable to report the error in a completion. The QP 
or EE context is transitioned to the Error State. 

36 

10.10.3 Effects of Errors on QP Service Types 3^ 

The different types of IB errors defined have varying effects on queue pro- 33 
cessing dependent upon the QP's Service Type. 3g 



It is important to note that catastrophic errors on the local QP have no di 
rect effect on the remote QP. No attempt is made to send a message ^'^ 



40 
41 
42 
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below the Verbs to tear down a connection just because a QP has en- 1 

countered an error. However, NAK codes which are generated as the re- 2 

suit of a QP being in the error state will have an effect on the QP receiving 3 

those NAK codes. 4 

5 
6 
7 
8 

C10-142: Completion errors on a Send Queue shall result in Send Queue 9 
processing being halted and the Send Queue state shall transition to the 10 
Send Queue Error State, as per the state diagram. The Work Request 1 1 
where the error occurred shall be completed in error. ^ 2 

13 

In the case of local send queue errors, any and all Work Requests on ^ 
the Send Queue in which the error occurred are completed in error by 

the Channel Interface. If the local error was an interface check, the re- ^ 5 

mote, corresponding Receive Queue will not consume a Work Re- 16 

quest and thus will not surface a completion error. If the local error was 1 7 

a processing error, the remote, corresponding Receive Queue may or ^ 3 

may not complete a Work Request in error. The condition of the local ^ g 
and remote memory when a completion error occurs on the send 
queue for RDMA and atomic operations is specified in 10.3.1.6 Send 
Queue Error (SQEr) . 

22 

01 0-1 43: For local Receive Queue completion errors, the Work Request 
on the Receive Queue in which the error occurred shall be completed in 

error by the CI. The QP shall be placed in the Error State. All subsequent 24 

Work Requests shall be completed in error. 25 

26 

CI 0-1 44: Affiliated Asynchronous Errors shall result in the QP processing 27 
being halted such that outstanding Work Requests are not completed sue- 
cessfully by the Channel Interface. The QP shall be transitioned to the 
Error State. Any request in progress on the corresponding Work Queue 

shall be halted and shall not be completed successfully. 30 

31 

C10-145: Table 70 Completion Error Handlino for RC Send Queues and 32 

Table 71 Completion Error Handlinc for RC Receive Queues are a more 33 

detailed description of the RC error handling actions that must be sup- 34 

ported by the CI according to the error and Work Queue type. ^5 

Descriptions of the error types used in the table are contained in 11.6.2 
Completion Return Status on page 511 37 

38 
39 
40 
41 
42 
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Table 70 Completion Error Handling for RC Send Queues 



Error Type 


Completion 
TvDe 


Effect on Local QP 
State 


Effect on Remote QP 
State 


LOCal Lengin 


inierrace 


SO Error 


None 


LOCal wpclaUun 


inierrace 


ow error 


None 


LOCal vjperaiion 


rrocessing 


SQ Error 


None 


Local Protection 


Interface 


SO Error 


None 


Local Protection 


Processing 


80 Error 


None 


Memory Window Bind 


Interface 


80 Error 


None 


Invalid Request 


Processing 


80 Error 


Error 


Remote Access 


Processing 


80 Error 


None 


Remote Operation 


Processing 


SQ Error 


Error 


RNR NAK Retry Counter 
Exceeded 


Processing 


SO Error 


None 


Transport Retry Counter 
Exceeded 


Processing 


80 Error 


None 



Table 71 Completion Error Handling for RC Receive Queues 



Error Type 


Completion 
Type 


Effect on local 
QP state 


Effect on remote QP 
state 


Local Length 


Processing 


Error 


Error when NAK received 


Local Protection 


Processing 


Error 


Error when NAK received 


Local Operation 


Processing 


Error 


Error when NAK received 



10.10.3.2 Reliable Datagram QPs: 

0IO-6O: If the CI supports RD Service, immediate errors shall have no ef- 
fect on QP/EE processing since the Worl< Request never gets posted to 
the QP/EE. 

0IO-6I: If the CI supports RD Service, completion errors on a Send 
Queue shall result in Send Queue processing being halted and the Send 
Queue state shall transition to the Send Queue Error State, as per the 
state diagram. The Work Request where the error occurred shall be com- 
pleted in error. 

In the case of local send queue errors, any and all Work Requests on 
the Send Queue in which the error occurred are completed in error by 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 
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the Channel Interface. If the local error was an interface check, the re- 
mote, corresponding Receive Queue will not consume a Work Re- 
quest and thus will not surface a completion error. If the local error was 
a processing error, the remote, corresponding Receive Queue may or 
may not complete a Work Request in error. The condition of the local 
and remote memory when a completion error occurs on the send 
queue for RDMA and atomic operations is specified in 10.3.1.6 Send 
Queue Error fSQEr) . 

O10-62: If the CI supports RD Service, for local Receive Queue comple- 
tion errors, the Work Request on the Receive Queue in which the error oc- 
curred shall be completed in error by the CI. All subsequent Work 
Requests shall not be affected by the error. 

O10-63: If the CI supports RD Service, completion errors shall have no 
effect on the EE Context State. 

o10-64: If the CI supports RD Service, Affiliated Asynchronous Errors 
shall result in the QP processing being halted such that outstanding Work 
Requests are not completed successfully by the Channel Interface. The 
QP shall transition to the Error State. Any request in progress on the cor- 
responding Work Queue shall be halted and shall not be completed suc- 
cessfully. 

o10-65: If the CI supports RD Service, when an Affiliated Asynchronous 
Error is associated only with the QP, the error shall have no effect on the 
EE context. If an Affiliated Error is associated with the EE context, the EE 
context shall transition to the Error state. 

0IO-66: If the CI supports RD Service, Table 72 Completion Error Han- 
dling for RD Send Queues and Table 73 Completion Error Handling for RD 
Receive Queues are a more detailed description of the RD error handling 
actions that must be supported by the CI for RD: 

Table 72 Completion Error Handling for RD Send Queues 



Error Type 


Completion 
Type 


Effect on 
Local QP 
State 


Effect on 
Remote QP 
State 


Effect on 
Local EE 
State 


Effect on 
Remote EE 
State 


Error 
Handling 
Action 


Local Length 


Interface 


SQ Error 


None 


None 


None 


1 


Local Operation - QP 


Interface 


SQ Error 


None 


None 


None 


1 


Local Operation - QP 


Processing 


SQ Error 


Rev WC Err 


None 


None 


1.3.6 


Local Operation - EE 


Processing 


SQ Error 


Indetermi- 
nate 


EE Error 


None 


1. 5 


Local Protection 


Interface 


SQ Error 


None 


None 


None 


1 


Local Protection 


Processing 


SQ Error 


Rev WC Err 


None 


None 


1,3,6 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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Table 72 Completion Error Handling for RD Send Queues 



crrur lypc 


Completion 
Type 


Effect on 

1 oral OP 

State 


Effect on 
state 


Effect on 
state 


Effect on 
State 


Error 
ndiiviiiiiy 
Action 


Remote Operation - OP 


Processing 


SQ Error 


Error 


None 


None 


1 


Remote Operation - EE 


Processing 


SO Error 


Error 


Error 


Error 


1.3.5 


Memory Window Bind 


Interface 


SQ Error 


None 


None 


None 


1 


Remote Access 


Processing 


SQ Error 


None 


None 


None 


1 


Remote Operation - OP 


Processing 


SQ Error 


Error 


None 


None 


1,3 


Remote Operation - EE 


Processing 


SQ Error 


Rev WC Err 


EE Error 


EE Error 


1.3.5 


Remote Invalid Request 


Processing 


SQ Error 


None if 1st 
packet. 
Opt Rev 
WC Err if 
other than 
1st packet. 


None 


None 


1.3 


Local RDD Violation 


Processing 


SQ Error 


None 


None 


None 


1 


Remote Invalid RD 
Request 


Processing 


SQ Error 


None if 1st 
packet. 
Opt Rev 
WC Err if 
other than 
1st packet. 


None 


None 


1, 3 


Transport Timeout Retry 
Counter Exceeded 


Processing 


SQ Error 


None 


Error 


None 


1 


RNR NAK Retry Counter 
Exceeded 


Processing 


SQ Error 


None 


None 


None 


1 



Table 73 Completion Error Handling for RD Receive Queues 



Error Type 


Completion 
Type 


Effect on local 
QP state 


Effect on 
remote QP 
state 


Effect on 
local EE 
state 


Effect on 
remote EE 
state 


Error 
Handling 
Action 


Local Length 


Processing 


Rev WC Err 


SQ Error 


None 


None 


1.3 


Local Operation - QP 


Processing 


Rev WC Err 


SQ Error 


None 


None 


1.3 


Local Operation - EE 


Processing 


Rev WC Err 


SQ Error 


EE Error 


EE Error 


1.3.5 


Local Protection 


Processing 


Rev WC Err 


SQ Error 


None 


None 


1.3 


Remote Invalid Request 


Processing 


None if 1st 
packet. 

Opt Rev WC En- 
if other than 1st 
packet. 


SQ Error 


None 


None 


1.3 
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Table 73 Completion Error Handling for RD Receive Queues 



Error Type 


Completion 
Type 


Effect on local 
QP state 


remote QP 
state 


Effect on 
local EE 
state 


Effect on 

remote EE 
state 


Error 

bl 1 Ul 

Handling 
Action 


Remote Invalid RD 
Request 


Processing 


None If 1st 
packet. 

Opt Rev WC Err 
if other than 1st 
packet. 


QP Error 


None 


None 


2 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



Error Handling Actions: 

Uninvolved SQs and RQs are unaffected unless they attempt to use an 
EE-context or QP that is in the error state. 

1 ) The SQ active over the EE-context at the time the error occurred 
goes to the SQEr state. 

• Receives for the RQ associated with the local SQ placed in SQEr 
state continue as normal (i.e. are not completed in error, unless 
they also experience a separate error). 

• Remainder of the WQEs in the SQ which experienced the error 
are returned in error via Work Completions (WCs). 

2) The SQ active over the EE-context at the time the error occurred 
causes the full QP (associated with the SQ) to be placed in the error 
state. 

Remainder of sends for the SQ that caused the error are returned 
in error via WCs. 

Any receives associated with the local RQ are returned in error 
via WCs. 

3) RQ active over the EE-context at the time the error occurred has the 
WQE which experienced the error returned in error via a WC. 

• All other WQEs in that RQ continue as normal (i.e. are not com- 
pleted in error, unless they also experience an error). 

• Sends for the SQ associated with the local RQ placed in SQEr 
state continue as normal (i.e. are not completed in error, unless 
they also experience a separate error). 

4) RQ active over the EE-context at the time the error occurred causes 
the full QP (associated with the RQ) to be placed in the error state. 

• Remainder of receives for the RQ that caused the error are re- 
turned in error via WCs. 

• Any sends associated with the local SQ are returned in error via 
WCs. 
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5) Local EE-context is placed in the error state. Remote EE-context 1 
state is indeterminate (i.e. may not be in the error state, for example 2 
as a result of a source timeout). 3 

• EE-context cannot be resumed. 4 

• Must re-establish EE-context. 5 

6) When a Local Protection or Operation SQ Error occurs on RD QPs, 
the CI on the local side shall emit an Infiniband no-op (RDMA Write of ^ 
length 0 with no immediate data) below the Verbs to the RD RQ asso- 8 
dated with the local error, assuming the RD channel is still opera- 9 
tional. This will cause the in-process RQ Work Request on the lo 
remote side to be completed in error. The Receive side cannot 
depend on receiving that message. ^2 

10.1 0.3.3 Unreliable Connected QPs: 1 3 

C10-146: Immediate errors shall have no effect on QP processing since 14 

the Work Request never gets posted to the QP. 1 5 



16 
17 
18 



C10-147: Completion errors on a Send Queue shall result in Send Queue 
processing being halted and the Send Queue state shall transition to the 
Send Queue Error State, as per the state diagram. The Work Request 
where the error occurred shall be completed in error. ^ ^ 

20 

In the case of local send queue errors, any and all Work Requests on 21 
the Send Queue in which the error occurred are completed in error by 22 
the Channel Interface. The remote, corresponding Receive Queue will 23 
not consume a Work Request and thus will not surface a completion 
error The condition of the remote memory when a completion error 
occurs on the send queue for RDMA Write operations is specified in 
10.3.1.6 Send Queue Error fSQErV 26 

97 

CI 0-1 48: For local Receive Queue completion errors, the Work Request y 
on the Receive Queue in which the error occurred shall be completed in 28 
error by the CI. The QP is placed in the Error State. All subsequent Work 29 
Requests shall be completed in error. 30 

31 

C10-149: Table 74 Completion Error Handling for UC Send Queues and 22 
Table 75 Completion Error Handling for UC Receive Queues are a more 
detailed description of the UC error handling actions that must be sup- 
ported by the CI according to the error and Work Queue type. 

35 
36 
37 
38 
39 
40 
41 
42 
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Descriptions of the error types used in the table are contained in 11.6.2 
Completion Return Status on page 511 . 

Table 74 Completion Error Handling for UC Send Queues 





Error Type 


V/Oinpiexion 
Type 


cTTeci on Locai KUr 
state 


cTTeci on Kemoie wr 
state 


Local Length 


Interface 


SQ Error 


None 


Local Operation 


Interface 


so Error 


None 


Local Operation 


Processing 


so Error 


None 


Local Protection 


Interface 


so Error 


None 


Local Protection 


Processing 


SQ Error 


None 


Memory Window Bind 


Interface 


so Error 


None 


Invalid Request 


Processing 


SQ Error 


Error 




Table 75 Completion Error Handling for UC Receive Queues 




Error Type 


Completion 
Type 


Effect on local 
QP state 


Effect on remote QP 
state 






Local Length 


Processing 


Error 


None 






Local Protection 


Processing 


Error 


None 






Local Operation 


Processing 


Error 


None 





C10-150: Affiliated Asynchronous Errors shall result in the QP processing 
being halted such that outstanding Work Requests are not completed suc- 
cessfully by the Channel Interface. The QP shall transition to the Error 
State. Any request in progress on the corresponding Work Queue shall 
be halted and shall not be completed successfully. 



10.10.3.4 Unreliable Datagram QPs: 



C10-151: Immediate errors shall have no effect on QP processing since 
the Work Request never gets posted to the QP. 

C10-152: Completion errors on a Send Queue shall result in Send Queue 
processing being halted and the Send Queue state shall transition to the 
Send Queue Error State, as per the state diagram. The Work Request 
where the error occurred shall be completed in error. 

In the case of local send queue errors, any and all Work Requests on 
the Send Queue in which the error occurred are completed in error by 
the Channel Interface. The remote, corresponding Receive Queue will 
not consume a Work Request and thus will not surface a completion 
error. 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
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31 
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C10-153: For local Receive Queue completion errors, the Work Request 
on the Receive Queue in which the error occurred shall be completed in 
error by the CI. The QP is placed in the Error State. All subsequent Work 
Requests shall be completed in error. 

CI 0-1 54: Table 76 Completion Error Handling for UP Send Queues and 
Table 77 Completion Error Handling for UP Receive Queues provide a 
more detailed description of the UC error handling actions that must be 
supported by the CI according to the error and Work Queue type. 

Pescriptions of the error types used in the table are contained in 11.6.2 
Completion Return Status on page 511 . 

Table 76 Completion Error Handling for UD Send Queues 





Error Type 


Completion 
Type 


Effect on Local QP 
state 


Effect on Remote QP 
state 


Local Length 


Interface 


SQ Error 


None 


Local Operation 


Interface 


SQ Error 


None 


Local Operation 


Processing 


SQ Error 


None 


Local Protection 


Interface 


SQ Error 


None 


Local Protection 


Processing 


SQ Error 


None 


Invalid Request 


Processing 


SQ Error 


Error 




Table 77 Completion Error Handling for UD Receive Queues 




Error Type 


Completion 
Type 


Effect on local 
QP state 


Effect on remote QP 
state 






Local Length 


Processing 


Error 


None 






Local Protection 


Processing 


Error 


None 






Local Operation 


Processing 


Error 


None 





10.10.3.5 RawQPs: 



C10-155: Affiliated Asynchronous Errors shall result in the QP processing 
being halted such that outstanding Work Requests are not completed suc- 
cessfully by the Channel Interface. The QP shall transition to the Error 
State. Any request in progress on the corresponding Work Queue shall 
be halted and shall not be completed successfully. 



C10-156: Immediate errors shall have no effect on QP processing since 
the Work Request never gets posted to the QP. 
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C10-157: Completion errors on a Send Queue shall result in Send Queue 
processing being halted and the Send Queue state shall transition to the 
Send Queue Error State, as per the state diagram. The Work Request 
where the error occurred shall be completed in error. 

In the case of local send queue errors, any and all Work Requests on 
the Send Queue in which the error occurred are completed in error by 
the Channel Interface. The remote, corresponding Receive Queue will 
not consume a Work Request and thus will not surface a completion 
error. 

01 0-1 58: For local Receive Queue completion errors, the Work Request 
on the Receive Queue in which the error occurred shall be completed in 
error by the CI. The QP is placed in the Error State. All subsequent Work 
Requests shall be completed in error. 

01 0-1 59: Table 78 Completion Error Handling for Raw Datagram Send 
Queues and Table 79 Completion Error Handling for Raw Datagram Re- 
ceive Queues provide a more detailed description of the DC error han- 
dling actions that must be supported by the CI according to the error and 
Work Queue type. 

Descriptions of the error types used in the table are contained in 11.6.2 
Completion Return Status on page 511 . 

Table 78 Completion Error Handling for Raw Datagram Send Queues 





Error Type 


Completion 
Type 


Effect on Local QP 
State 


Effect on Remote QP 
State 


Local Length 


Interface 


SQ Error 


None 


Local Operation 


Interface 


SO Error 


None 


Local Operation 


Processing 


SQ Error 


None 


Local Protection 


Interface 


SQ Error 


None 


Local Protection 


Processing 


SQ Error 


None 


Invalid Request 


Processing 


SQ Error 


Error 


Table 79 Completion Error Handling for Raw Datagram Receive Queues 




Error Type 


Confipletion 
Type 


Effect on local 
QP state 


Effect on remote QP 
state 






Local Length 


Processing 


Error 


None 






Local Protection 


Processing 


Error 


None 






Local Operation 


Processing 


Error 


None 
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CI 0-1 60: Affiliated Asynchronous Errors shall result in the QP processing 1 
being halted such that outstanding Work Requests are not completed sue- 2 
cessfully by the Channel Interface. The QP shall transition to the Error 3 
State. Any request in progress on the corresponding Work Queue shall ^ 
be halted and shall not be completed successfully. 
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11.1 Verbs Introduction and Overview 



Chapter 1 1 : Software Transport Verbs i 

2 

3 
4 
5 
6 

The Verbs described in this chapter provide an abstract definition of the 7 
functionality provided to a host by a host channel interface. Host CIs g 
which are compliant with this specification must exhibit the semantic be- g 
havior described by the Verbs. 

10 

Since the Verbs define the behavior of the host CI, they may influence the ^ ^ 
design of software constructs, such as application programming inter- 12 
faces (APIs), which provide access to the host CI. However, this specifi- 13 
cation explicitly does not define any such API. In particular, there is no 
requirement that an API used with a compliant host CI be semantically ^ ^ 



consistent with the Verbs. 

11.1.1 Verb Classes 
11.1-1.1 Mandatory vs. Optional Verbs 

Some Verbs are mandatory, and some are required only if an optional fea- 



ture is supported. 



16 
17 
18 
19 
20 
21 

C11-1 : A CI shall support all Verbs classified as mandatory in Table 80 22 
Verb Classes . 23 

24 

011-2: If a CI claims conformance to an optional feature, the CI shall sup- 25 
port all Verbs associated with that optional feature as indicated in Table 80 2g 
Verb Classes . 

27 

11.1.1.2 Mandatory vs. Optional Verb Functionality 28 

Some Verbs define functionality that applies only if certain optional fea- 
tures are supported. 30 

31 

C11-3: If a CI supports a given Verb, the CI shall support all functionality 32 
defined for that Verb that's not indicated as being optional. 33 



34 
35 



C11-4: If a CI supports a given Verb and claims conformance to an op- 
tional feature, the CI shall support all functionality defined for that Verb 
that's associated with that optional feature. 

37 

11.1.1.3 Consumer Accessibility 38 

Verb Consumers are the direct users of the Verbs, and are sub-divided 39 
into two classes, Privileged and User-Level. 40 

41 
42 
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Privileged Consumers are typically those Consumers that operate at a 
privilege level sufficient to access OS internal data structures directly, and 
that have the responsibility to control access to the Channel Interface. All 
Verbs are available for use by Privileged Consumers. 

User-Level Consumers are those Consumers that must rely on another 
agent, having a sufficient high level of privilege, to manipulate OS data 
structures. Only those Verbs specifically labeled as such are available for 
use by User-Level Consumers 

Table 80 Verb Classes 



Verb 


Mandatory/Optional 
Classification 


Consumer 
Accessibility 


Open HCA 


Mandatory 


Privileged 


Query HCA 


Mandatory 


Privileged 


Modify HCA Attributes 


Access violation 
counters 


Privileged 


Close HCA 


Mandatory 


Privileged 


Allocate Protection Domain 


Mandatory 


Privileged 


Deallocate Protection Domain 


Mandatory 


Privileged 


Allocate Reliable Datagram Domain 


RD Service 


Privileged 


Deallocate Reliable Datagram Domain 


RD Service 


Privileged 


Create Address Handle 


Mandatory 


User-Level 
and Privileged 


Modify Address Handle 


Mandatory 


User-Level 
and Privileged 


Query Address Handle 


Mandatory 


User-Level 
and Privileged 


Destroy Address Handle 


Mandatory 


User-Level 
and Privileged 


Create Queue Pair 


Mandatory 


Privileged 


Modify Queue Pair 


Mandatory 


Privileged 


Query Queue Pair 


Mandatory 


Privileged 


Destroy Queue Pair 


Mandatory 


Privileged 


Get Special QP 


Mandatory 


Privileged 


Create Completion Queue 


Mandatory 


Privileged 


Query Completion Queue 


Mandatory 


Privileged 


Resize Completion Queue 


Mandatory 


Privileged 
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Table 80 Verb Classes 



Verb 


Mandatory/Optional 
Classification 


Consumer 
Accessibility 


Destroy Completion Queue 


Mandatory 


Privileged 


Create EE Context 


RD Service 


Privileged 


Modify EE Context Attributes 


RD Service 


Privileged 


Query EE Context 


RD Service 


Privileged 


Destroy EE Context 


RD Service 


Privileged 


Register Memory Region 


Mandatory 


Privileged 


Register Physical Memory Region 


Mandatory 


Privileged 


Query Memory Region 


Mandatory 


Privileged 


Deregister Memory Region 


Mandatory 


Privileged 


Reregister Memory Region 


Mandatory 


Privileged 


Reregister Physical Memory Region 


Mandatory 


Privileged 


Register Shared Memory Region 


Mandatory 


Privileged 


Allocate Memory Window 


Mandatory 


Privileged 


Query Memory Window 


Mandatory 


Privileged 


Bind Memory Window 


Mandatory 


User-Level 
and Privileged 


Deallocate Memory Window 


Mandatory 


Privileged 


Attach QP to Multicast Group 


UD Multicast Service 


Privileged 


Dpataph OP fmm Multicast f^roiin 


1 in IV^iiltioact Qorvifo 




Post Send Request 


Mandatory 


User-Level 

snri Pri\/ilonoH 
allU rllVMcycU 


Poet [?or*oi\/o Poniioct 
1 Uol rxCLrCivc r\cL{Licoi 


IVIaf lUdiury 


1 lcor_l o\/ol 
Uocr-LcV6l 

and Privileged 


Poll for Completion 


Mandatory 


User-Level 
and Privileged 


Request Completion Notification 


Mandatory 


User-Level 
and Privileged 


Set Completion Event Handler 


Mandatory 


Privileged 


Set Asynchronous Event Handler 


Mandatory 


Privileged 
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11.2 Transport Resource Management i 
11.2.1 HCA 2 

11.2.1.1 Open HCA ^ 

4 

Description: ^ 

Opens the specified HCA and returns an opaque object or handle to ^ 
uniquely reference each HCA so that Consumers can distinguish be- ^ 
tween HCAs in the endnode. 8 

C1 1-5: The handles returned for different HCAs within a system shall all ^ 
be unique. '10 

11 

Once opened, a specific HCA cannot be opened again until after it is 12 
closed. Opening the HCA prepares the HCA for use by the Consumer. ^ 3 

C11-6: If Open HCA is called for an HCA that is currently open, the CI 14 
shall return the HCA already in use error. 1 5 

16 

Input Modifiers: 

1 R 

• The unique identifier for this HCA. The naming scheme is defined 
bytheOSV. 19 

20 

Output Modifiers: 

^ 21 

• A handle for the HCA instance used as a modifier to other Verbs 22 
to specify the desired target HCA. 23 

• Verb Results: 24 

25 

• Operation completed successfully. 

Insufficient resources to complete request. 27 
Invalid HCA name. 28 

• HCA already in use. 29 

11.2.1.2 Query HCA 30 

11 

Description: 

32 

Returns the attributes for the specified HCA. 33 

The maximum values defined in this section are guaranteed not-to-ex- 
ceed values. It is possible for an implementation to allocate some HCA 35 
resources from the same space. In that case, the maximum values re- 36 
turned are not guaranteed for all of those resources simultaneously. 37 

Input Modifiers: 38 

39 

• HCA handle. 40 

Output Modifiers: 41 

42 
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• The HCA attributes returned are: 1 

• Vendor specific information such as: 2 

3 

• Vendor ID. 

4 

• Vendor supplied Part ID. ^ 
Hardware version. g 

HCA specific values: 7 

8 

• The maximum number of QPs supported by this HCA. g 

• The maximum number of outstanding work requests on any 1 0 
Work Queue supported by this HCA. i>| 

• Ability of this HCA to support resizing of the number of out- 1 2 
standing WQEs on the work queues of a QP. 1 3 

• The maximum number of scatter/gather entries per Work Re- 14 
quest supported by this HCA, for all Work Requests other 1 5 
than Reliable Datagram Receive Queue Work Requests. 

• The maximum number of scatter/gather entries per Reliable 1 7 
Datagram Receive Queue Work Request supported by this 18 
HCA. Zero if RD Service is not supported. 

• The maximum number of CQs supported by this HCA. 20 

• The maximum number of entries in each CQ supported by 21 
this HCA. 22 

• The maximum number of Memory Regions supported by this 23 
HCA. 24 

• The largest contiguous block that can be registered by this 
HCA, specified in bytes. 26 

97 

• The maximum number of Protection Domains supported by 

this HCA. 28 

29 

• The memory page sizes supported by this HCA. 

• The maximum number of virtual lanes supported by this HCA. 
Number of physical ports on this HCA. 32 

• Maximum number of partitions supported by this HCA. The 33 
number of partitions supported must be at least one. 34 

• MTU and message size supported by this HCA, on a per port 35 
basis. 36 

• Base LID & LMC for each port of this HCA. These values are 37 
valid only when the Port State of the port is Armed or Active. 38 
For other port states the values returned are indeterminate. 39 
(For more information on the port state see 14.4.4 Port State 4Q 
Transitions on page 665 ). 

42 
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Contents and length of the Source GID Table. The value of 
Assigned GIDs are valid only when Port State is Armed or Ac- 
tive. For other states the value of assigned GIDs is indetermi- 
nate. 

Link state of each port of this HCA see 7.2.7 State Machine 
Terms on page 137 . 

Contents and length of the partition table. A partition table is 
required per port. The contents of the partition table are valid 
only when the Port State is Armed or Active. For other states 
the contents of the partition table are implementation depen- 
dent. 

Bad P_Key counter support indicator. 

Optional Bad P_Key counter for each port supported by the 
HCA. 

Q_Key Violation counter support indicator. 

Q-Key Violation counter for each port supported by the HCA. 

Contents of the Subnet Manager address information. This is 
a table, with entries arranged on a per HCA port basis, which 
contains the LID and Service Level of the Subnet Manager for 
that port. If this has not been set by the Subnet Manager 
(Port State is Armed or Active), this should be set to the per- 
missive LID (OxFFFF). 

The maximum number of RDMA Reads & atomic operations 
that can be outstanding per QP with this HCA as the target. 
Shall apply to atomics only if this HCA supports atomic opera- 
tions. 

The maximum number of RDMA Reads & atomic operations 
that can be outstanding per EE with this HCA as the target. 
Shall apply to atomics only if this HCA supports RD & atomic 
operations. For this version of the specification, this value is 
one. 

The maximum number of resources used for RDMA Reads & 
atomic operations by this HCA with this HCA as the target. 
Shall apply to atomics only if this HCA supports atomic opera- 
tions. 

The maximum depth per QP for initiation of RDMA Read & 
atomic operations by this HCA. Shall apply to atomics only if 
this HCA supports atomic operations. 

The maximum depth per EE for initiation of RDMA Read & 
atomic operations by this HCA. Shall apply to atomics only if 
this HCA supports RD & atomic operations. For this version 
of the specification, this value is one. 
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• Maximum allowable number of entries of the Scatter/Gather 1 
List for Reliable Datagram Receive Queue Work Requests. 2 

• Ability of this HCA to support atomic operations as well as se- 3 
rialization of atomic operations between itself and other sys- 4 
tem components such as processors and other HCAs. Three 5 
levels of atomicity are defined for this version of the specifica- g 
tion: ^ 

• Atomic operations not supported. g 

• Atomicity is guaranteed only between QPs on this HCA 9 
only. 10 

• Atomicity is guaranteed between this HCA and any other 11 
component, such as CPUs and other HCAs. 12 

• The maximum number of EE contexts that can be supported 1 3 
by this HCA. Shall be zero if the HCA does not support Reli- 14 
able Datagrams. 15 

• Maximum number of RDDs supported by this HCA. The num- 16 
ber of RDDs supported must be at least two. Shall be zero if 1 7 
the HCA does not support Reliable Datagrams. ^3 

• The maximum number of Memory Windows supported by this 1 9 
HCA. 20 

• The maximum number of Raw IPv6 Datagram QPs supported 21 
by this HCA. Shall be zero if Raw IPv6 Datagrams are not 22 
supported. 23 

• The maximum number of Raw Ethertype Datagram QPs sup- 24 
ported by this HCA. Shall be zero if Raw Ethertype Data- 25 
grams are not supported. 25 

• Ability of this HCA to support modifying the maximum number 27 
of outstanding Work Requests per QP. 28 

• Maximum number of multicast groups supported by this HCA. 29 
Shall be zero if this HCA does not support IBA unreliable mul- 30 
ticast. 3^ 

• Maximum number of QPs which can be attached to multicast 32 
groups for this HCA. Shall be zero if this HCA does not sup- 33 
port IBA unreliable multicast. 

• Maximum number of QPs per multicast group supported by 35 
this HCA. Shall be zero if this HCA does not support IBA un- 
reliable multicast. ^7 

• Ability of this HCA to support raw packet multicast. 33 

• Ability of this HCA to support automatic path migration. 39 

Maximum number of Address Handles supported by this 40 

HCA. 41 

42 
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• Verb Results: 1 

• Operation completed successfully. 2 

3 

Insufficient resources to complete this request. ^ 

• Invalid HCA handle. g 
11.2,1,3 Modify HCA Attributes 6 

Description: 7 

8 

Modifies the optional key counters in the HCA. Shall apply only if the g 
HCA supports the Bad P_Key counter or the Invalid Q_Key counter. 

Input Modifiers: 

• HCA handle. 

13 

Counter to modify. Valid Counters are the Bad P_Key counter 

and the Invalid Q Key counter. 

- ' 15 

• Value of key counter. ^ g 
Output Modifiers: 17 

18 
19 

Operation completed successfully. 20 
Invalid HCA handle. 21 
Invalid Counter specified. 

23 

Invalid Counter value. 

24 
25 

Description: 26 

27 

Closes and resets the specified HCA. This Verb is responsible only for 
deallocating resources allocated by the Channel Interface to prepare 
the HCA for use by the Consumer. All other resources are no longer 
associated or connected with the CI and are the responsibility of the 30 
Consumer to handle as deemed necessary. 31 

Input Modifiers: ^2 

33 

• HCA handle. 34 

Output Modifiers: 35 

36 

• Verb Results: 37 

Operation completed successfully. 38 

• Invalid HCA handle. 

40 

11.2,1.5 Allocate Protection Domain 

41 

Description: 42 
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Allocates an unused Protection Domain object. Protection Domain ob- 1 

jects are required when creating a Queue Pair or Address Handle, 2 

registering memory and allocating memory windows. A Protection Do- 3 

main object provides an association between Queue Pairs, Address ^ 
Handles, Memory Regions and Memory Windows. Operations on a 
Queue Pair that cause access to a Memory Region or a Memory 

Window are allowed only when the Protection Domain of the Queue ^ 

Pair and the Protection Domain of the Memory Region or a Memory 7 

Window are identical. 8 

Operations on an unreliable datagram queue pair are allowed only ^ 

when the Protection Domain of the Queue Pair and the Protection Do- 1 0 

main of the Address Handle contained in the work request are iden- 11 

tical. 12 

Input Modifiers: 13 

14 

• HCA Handle. 15 

Output Modifiers: 1 6 

17 

• Protection Domain Object. 13 

• Verb Results: 19 

• Operation completed successfully. 20 

21 

Insufficient resources to complete request. 

22 

Invalid HCA handle. 23 
11.2.1.6 Deallocate Protection Domain 24 

Description: 25 

26 

Returns a previously Allocated Protection Domain object for reuse by 27 
the Allocate Protection Domain Verb. The Protection Domain object 
cannot be deallocated if it is still associated with any Queue Pair, 
Memory Region or Memory Window, or Address Handle. 29 

Input Modifiers: 

31 

• HCA Handle. 32 

• Protection Domain object. 



33 
34 

Output Modifiers: 25 



Verb Results: 

• Operation completed successfully. 
Invalid Protection Domain. 

• Protection Domain is in use. 40 

• Invalid HCA handle. 41 

42 
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1 1 .2.1 .7 Allocate Reliable Datagram Domain 1 

Description: 2 

3 

Allocates an unused Reliable datagram domain object. Reliable data- 4 
gram domain objects are required when setting up a reliable datagram 5 
Queue Pair and EE contexts. A reliable datagram domain object pro- g 
vides an association between Queue Pairs and EE contexts. Opera- 
tions on a reliable datagram queue pair directed at an EE context are 
allowed only when the reliable datagram domain of the queue pair and ^ 
the reliable datagram domain of the EE context are identical. 9 

10 

Input Modifiers: 

11 

• HCA Handle. 12 
Output Modifiers: 

14 

Reliable datagram domain object. 15 

• Verb Results: ^6 

17 

Operation completed successfully. 

Insufficient resources to complete request. ^ g 

Invalid HCA handle. 20 

• Reliable Datagrams not supported. 21 

11.2.1.8 Deallocate Reliable Datagram Domain 22 

23 

Description: 

24 

Returns a previously allocated reliable datagram domain object for 25 
reuse by the Allocate Reliable Datagram Domain Verb. The reliable 26 
datagram domain object cannot be deallocated if it is still associated 27 
with a Queue Pair or an EE context. 28 

Input Modifiers: 29 

30 

• HCA Handle. 3^ 

• Reliable datagram domain object. 32 

Output Modifiers: 33 

34 

• Verb Results: 35 

Operation completed successfully. 36 

Invalid reliable datagram domain. 37 

38 

Reliable datagram domain is in use. 

39 

• Invalid HCA handle. 

Reliable Datagrams not supported. 41 

42 
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11.2.2 Address Management Verbs i 

These Verbs create, manipulate and destroy address handles. These ad- 2 
dress handles are only used for Work Requests submitted to Unreliable 3 
Datagram Service Type QPs. 4 

5 
6 

Description: 7 



11.2.2.1 Create Address Handle 



Protection domain 
• Address vector, for UD transports only, containing: 
• Service level. 



Address Handle. 



8 
9 

10 
11 



The purpose of the Create Address Handle Verb is to create an ad- 
dress handle for the address vector passed in through the Verbs. The 
normal completion for this Verb returns the address handle. The ad- 
dress handle is used to reference a local or global destination in all UD 
QP Post Sends. 12 

Input Modifiers: 

14 

• HCA Handle. 15 

16 
17 
18 
19 

Send Global Routing Header Flag. 20 

• Destination LID. If destination is in same subnet, LID = final 21 
destination; otherwise LID = router LID. 22 

For global destination: 23 

Flow label. 

Hop limit. 

26 

• Traffic class. 27 
Source GID index. 28 

• For global destination or Multicast address: 29 

• Destination's GID (a.k.a. IPv6) address. ^0 

31 

• Maximum Static Rate. 

32 

Source Path Bits. 23 

Output Modifiers: 34 

35 
36 

Verb results: 37 

• Operation completed successfully. 38 

• Invalid HCA handle. 39 

40 

Invalid protection domain 

41 

Insufficient resources to complete request. 
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1 1 .2.2.2 Modify Address Handle 1 

Description: 2 

3 

The purpose of the Modify Address Handle Verb is to change an ad- 4 
dress vector associated with the address handle passed in by the 5 
Consumer. 

6 

Input Modifiers: 7 



8 
9 

10 



13 
14 
15 
16 
17 



• HCA Handle. 

• Address Handle. 

• Address vector, for UD transports only, containing: 1 1 

• Service level. 12 

• Send Global Routing Header Flag. 

• Destination LID. If destination is in same subnet, LID = final 
destination; otherwise LID = router LID. 

• For global destination: 

• Flow label. 18 

• Hop limit 19 

• Traffic class. 

21 

Source GID index. 

22 

For global destination or Multicast address: 23 
Destination's GID (a.k.a. IPv6) address. 24 
Maximum Static Rate. 25 

• Source Path Bits. 26 

27 

Output Modifiers: 

28 

• Verb results: 29 

30 

• Operation completed successfully. 

• Invalid HCA handle. 22 
Invalid address handle. 33 

11.2.2.3 Query Address Handle 34 

Description: 35 

36 

The purpose of the Query Address Handle Verb is to obtain the ad- 37 
dress vector associated with the address handle passed in by the 33 
Consumer. 2^ 

Input Modifiers: 40 



• HCA Handle. 



41 
42 
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• Address Handle. 1 

Output Modifiers: 2 

3 

• Address vector, for UD transports only, containing: 4 

• Service level. 5 

• Send Global Routing Header Flag. 

Destination LID. If destination is in same subnet, LID = final g 
destination; otherwise LID = router LID. 

9 

For global destination: ^ q 

Flow label. 11 

• Hop limit. 12 
Traffic class. 

14 

Source GID index. 

15 

For global destination or Multicast address: 

Destination's GID (a.k.a. IPv6) address. 17 

• Maximum Static Rate. 1 8 

• Source Path Bits. 
Protection domain 

21 

Verb results: 22 

• Operation completed successfully. 23 

• Invalid HCA handle. 24 
Invalid address handle. 

11.2.2.4 Destroy Address Handle 

27 

Description: 23 

The purpose of the Destroy Address Handle Verb is to remove an ad- ^9 

dress vector and its associated address handle from the CI. After the 30 

address handle is removed, it can no longer be used to reference the 31 

destination. 32 

Input Modifiers: 33 

34 

• HCA Handle. 35 

• Address Handle. 36 
Output Modifiers: 

38 

• Verb results: 39 

• Operation completed successfully. 40 

• Invalid HCA handle. 

42 
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• Invalid address handle. 1 
11.2.3 Queue Pair 2 

3 

11.2.3.1 Create Queue Pair 

4 

Description: _ 

D 

Creates a QP for the specified HCA. ^ 
A set of initial QP attributes must be specified by the Consumer. 

o 

011-7: If any of the required initial attributes are illegal or missing, an g 
error shall be returned and the Queue Pair shall not be created. 

11 

On success, a handle to the newly created QP and the QP number are 
returned. 

13 

Input Modifiers: 

• HCA handle. 

1 6 

• The QP attributes that must be specified at QP create time are: 

17 

• The CQ to be associated with the Send Queue. 

• The CQ to be associated with the Receive Queue. 19 

• The maximum number of outstanding Work Requests the 20 
Consumer expects to submit to the Send Queue. 21 

• The maximum number of outstanding Work Requests the 22 
Consumer expects to submit to the Receive Queue. 23 

• The maximum number of scatter/gather elements the Con- 24 
sumer will specify in a Work Request submitted to the Send 25 
Queue. 26 

• The maximum number of scatter/gather elements the Con- 27 
sumer will specify in a Work Request submitted to the Re- 28 
ceive Queue. 29 

• Reliable datagram domain to be associated with this QP. Ap- 30 
plicable only to RD QPs. 31 

• The Signaling Type must be specified for the Send Queue on 32 
this QP. The valid types are: 33 

All Work Requests submitted to the Send Queue always 34 
generate a completion entry. 35 

Consumer must specify on each Work Request submitted 
to the Send Queue whether to generate a completion 3^ 
entry for successful completions. 38 

• The Consumer must specify a Protection Domain. 39 

40 

• The Transport Service Type requested for this QP. Valid Ser- 
vice Types are: 

42 
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11.2.3.2 Modify Queue Pair 



Reliable Connection. 
Reliable Datagram. 
Unreliable Connection. 
Unreliable Datagram. 
Output Modifiers: 

• The handle for the newly created QP. 
QP number. 

• The actual number of outstanding Work Requests supported on 
the Send Queue. If an error is not returned, this is guaranteed to 
be greater than or equal to the number requested. (This may re- 
quire the Consumer to increase the size of the CQ.) 

The actual number of outstanding Work Requests supported on 
the Receive Queue. If an error is not returned, this is guaranteed 
to be greater than or equal to the number requested. (This may 
require the Consumer to increase the size of the CQ.) 

• The actual number of scatter/gather elements that can be speci- 
fied in Work Requests submitted to the Send Queue. If an error is 
not returned, this is guaranteed to be greater than or equal to the 
number requested. 

• The actual number of scatter/gather elements that can be speci- 
fied in Work Requests submitted to the Receive Queue. If an er- 
ror is not returned, this is guaranteed to be greater than or equal 
to the number requested. 

• Verb Results: 
Operation completed successfully. 
Insufficient resources to complete request. 
Invalid HCA handle. 
Invalid CQ handle. 

Maximum number of Work Requests requested exceeds HCA 
capability. 

Maximum number of scatter/gather elements requested ex- 
ceeds HCA capability. 

Invalid Protection Domain. 

Invalid Service Type for this QP. 

Invalid Reliable Datagram Domain. 

Description: 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 
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28 

29 
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31 

32 

33 

34 
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C11-8: Upon invocation of this Verb, the CI shall modify the attributes for 1 
the specified QP and then shall cause the QP to transition to the specified 2 
QP state. 3 



Only a subset of the QP attributes can be modified in each of the QP 
states. 

C11-9: If any of the QP attributes to be modified are invalid or the re- 
quested state transition is invalid, none of the QP attributes shall be mod- 
ified. An immediate error shall be returned and the QP state shall remain 
unchanged. 

Some QP attributes can be modified with outstanding Work Requests. 
WRs can be outstanding on the Receive Queue when the QP is in the 
Init, RTR, RTS & SQEr state and on the Send Queue when the QP is 
in the RTS state. Any outstanding Work Request on a Work Queue 
may not execute as expected if the QP modifiers are changed. For 
instance, if RDMA Reads, which were successfully posted, are out- 
standing when the QP is modified to no longer allow RDMA Reads, 
some outstanding in-flight RDMA Reads may complete while pending 
WRs may fail. 

C11-10: The properties and requirements of the QP state transitions shall 
be supported as shown in Table 81 . 

Table 81. 



Table 81 : QP State Transition Properties 


Transition 


Required Attributes 


Optional Attributes 


Actions 


Reset to 
Init 


Enable/disable RDMA^ 
and Atomic Operations. 
P_Key index. 
Physical port. 
Q_Key for uncon- 
nected Service Types. 


None. 


Enable posting to the 
Receive Queue. 


Init to RTR 


Remote Node Address 
Vector (Connected QPs 
only). 
RQ PSN. 

Number of responder 
resources for RDMA 
Read/atomic ops. 


Alternate destination 
node address (RC/UC 
QPs only). 

Enable/disable RDMA^ 
and Atomic Operations. 
P_Key index. 
Q_Key. 

Number of WQEs. 
Minimum RNR NAK 
Timer Field (RC QPs 
only). 


Activate receive processing. 



4 
5 
6 
7 
8 
9 

10 
11 
12 
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21 
22 
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28 
29 
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31 
32 
33 
34 
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36 
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Table 81 : QP State Transition Properties 


Transition 


Required Attributes 


Optional Attributes 


Actions 


RTR to RTS 


Timeout (RC QP only) 
Retry count (RC QP 
only) 

RNR retry count (RC 
QP only) 
SQ PSN. 

Number of Outstanding 
RDMA Read/atomic ops 
at destination. 


Enable/disable RDMA^ 
& Atomic Operations. 
Q_Key. 

Alternate destination 
node address (RC/UC 
QPs only). 

Path migration state. 
Number of WQEs. 
Minimum RNR NAK 
Timer Field (RC QPs 
only). 


Activate send processing. 


RTS to RTS 
(no transition) 


None. 


Enabte/disabte RDMA 

Operations.^ 

Q_Key. 

Alternate destination 
node address (RC/UC 
QPs only). 
Path Migration state. 
Number of WQEs. 
Minimum RNR NAK 
Timer Field (RC QPs 


No transition. 


SQEr to RTS 


None. 


Enable/disable RDMA& 
Atomic Operations.^ 
Q_Key. 

Number of WQEs. 
Minimum RNR NAK 
Timer Field (RC QPs 
only). 


Activate send processing. 


RTS to SQD 


None. 


None. 


Deactivate send processing. 
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Table 81 : QP State Transition Properties 



Transition 



Required Attributes 



Optional Attributes 



Actions 



SQD to RTS 



None. 



Remote Node Address 
Vector (Connected QPs 
only). 

Alternate destination 
node address (RC/UC 
QPs only). 
Channel migration 
state. 

Number of Outstanding 
RDMA Read/atomic ops 
at destination. 

Number of local RDMA 
Read/atomic responder 
resources. 
Q_Key. 

Timeout/Retry Informa- 
tion, 

Number of WQEs. 
Minimum RNR NAK 
Timer Field (RC QPs 
only). 



Activate send processing. 



Any State to Error 



None. 



None allowed. 



Queue processing is 
stopped. 

Work Requests pending or 
in process are completed in 
error, when possible. 



Any state to Reset 



None. 



None allowed. 



QP attributes are reset to 
the same values after the 
QP was created. 
Outstanding Work Requests 
are removed from the 
queues without notifying the 
Consumer. 



a. If disable RDMA is requested while incoming RDMAs to that queue are in process, it is 

indeterminate when the disable will take effect. It is up to the Consumer to coordinate the disable 
with the remote QPs. 



Input Modifiers: 

• HCA handle. 

• QP handle. 



The QP attributes to modify and their new values. The QP at- 
tributes that can be modified after the QP has been created are: 
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• Next QP state. If specify the current state, only the QP at- 1 
tributes will be modified. 2 

• Enable or disable Send Queue Drained, Asynchronous 3 

Affiliated Event Notification. This modifier is only 4 

applicable when the next QP state chosen is SQD. 5 

Primary P_Key index. Not applicable on a Raw Datagram or 6 

Reliable Datagram QPs. 7 

• The Q_Key that incoming Datagram messages are checked 8 
against and possibly used as the outgoing Q_Key (based on 9 
the WR Q_Key). This applies only to UD & RD QPs. 10 

• PSN for Send QR Applicable only for RC, UD & UC QPs. 1 1 

• The maximum number of outstanding Work Requests the ^ ^ 
Consumer expects to submit to the Send Queue, if resizing of 1 3 
the work queues is supported by the HCA. 14 

• The maximum number of outstanding Work Requests the 1 5 
Consumer expects to submit to the Receive Queue, if resizing 1 6 
of the work queues is supported by the HCA. 1 7 

The following attributes are not applicable if the QP specified is a 18 

Special QP: SMI QP (QPO). GSI QP (QP1), Raw IPv6 and Raw 19 

Ethertype. 20 

Primary physical port associated with this QP. Not applicable 21 

on RD QPs. 22 

• PSN for Receive QP. Applicable only for RC & UC QPs. 23 

Enable or disable incoming RDMA Reads on this QP. Not ap- 

plicable on Unreliable Service Type QPs. 25 

2fi 

• Enable or disable incoming RDMA Writes on this QP. Not ap- 
plicable on UD Service Type QPs. 27 

28 

• Enable or disable incoming Atomic Operations on this QP. Not 
applicable on Unreliable Service Type QPs. 29 

30 

31 

Number of RDMA Reads & atomic operations outstanding at 32 
any time. Applicable only to RC QPs. 

Number of responder resources for handling incoming RDMA 34 

Reads & atomic operations. This value may be rounded up to 3^ 
a supported value, not to exceed the maximum value allow- 
able for QPs for this HCA. Applicable only to RC QPs, 



Destination QP number. Applicable only to RC & UC QPs. 



36 
37 
38 



Minimum RNR NAK Timer Field Value. When a message ar- 
rives which is targeted at a local receive queue, and that re- 
ceive queue has no receive work requests outstanding, the CI ^® 
may respond to the initiator with an RNR NAK packet. This 40 
modifier is the minimum value which shall be sent in the Timer 41 
Field of such an RNR NAK packet; it does not affect RNR 42 
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NAKs sent for other reasons. If the value specified is not one 
of the RNR NAK Timer Field values defined in Table 45 En- 
coding for RNR NAK Timer Field on page 283 . the CI shall re- 
turn an immediate error. Applicable only to RC QPs. 

Address vector, for RC & UC transports only, containing: 

Service level. 

Send Global Routing Header Flag. 

Destination LID. If destination is in same subnet, LID = 
final destination; otherwise LID = router LID. 

Path MTU. 

Maximum Static Rate. 
Timeout. Applicable only to RC QPs. 
Retry count. Applicable only to RC QPs. 
RNR retry count. Applicable only to RC QPs. 
Source Path Bits. 
For global destination: 
Traffic class. 
Flow label. 
Hop limit. 
Source GID index. 

Destination's GID (a.k.a. IPv6) address. 

Alternate path address information, applicable only for RC & UC 
QPs when this CI support automatic path migration. Note: the 
path MTU for the alternate path must be the same as for the pri- 
mary path. The specifics are: 

• Alternate path P_Key index. 

• Alternate path Physical port. 

• Alternate path address vector, containing: 
Service level. 

Send Global Routing Header Flag. 

Destination LID. If the destination is in the same subnet, 
LID = final destination; othenwise LID = router LID. 

Maximum Static Rate. 

Timeout. Applicable only to RC QPs. 

Retry count. Applicable only to RC QPs. 

RNR retry count. Applicable only to RC QPs. 

Source Path Bits. 
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For global destination: 

• Traffic class. 
Flow label. 

• Hop limit. 

• Source GID index. 

• Destination's GID (a.k.a. IPv6 address). 

Path Migration state. Valid only if this HCA supports automatic 
path migration. Valid states to set are: 

Migrated. 

• Rearm. 

Output Modifiers: 

• The actual number of outstanding Work Requests supported on 
the Send Queue, If resizing of the work queues is supported by 
the HCA. If an error is not returned, this is guaranteed to be great- 
er than or equal to the number requested. (This may require the 
Consumer to increase the size of the CQ.) 

• The actual number of outstanding Work Requests supported on 
the Receive Queue, if resizing of the work queues is supported by 
the HCA. If an error is not returned, this is guaranteed to be great- 
er than or equal to the number requested. (This may require the 
Consumer to increase the size of the CQ.) 

• Verb Results: 
Operation completed successfully. 
Insufficient resources to complete request. 
Invalid HCA handle. 
Invalid QP handle. 
Cannot change QP attribute. 
Atomic operations not supported. 
P_Key index out of range. 
P_Key index specifies Invalid entry in P_Key table. 
Invalid QP state. 
Invalid path migration state. 
MTU of HCA port exceeded. 
Invalid Port. 

Invalid Service Type for this QP 

Maximum number of Work Requests requested exceeds HCA 
capability. 
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• Invalid RNR NAK Timer Field value. 1 
11.2.3.3 Query Queue Pair 2 

o 

Description: 

4 

Returns the attribute list and current values for the specified QP. This 5 
QP handle can be any QP handle supplied by the Verbs. 6 

Input Modifiers: ^ 

8 

• HCA handle. 9 

• QP handle. 10 

1 1 

Output Modifiers: 

12 

• The QP attributes. The list of attributes returned by the query are: 1 3 

• The QP number. 

15 
16 
17 
18 
19 

• The actual number of outstanding requests supported on the 20 
Send Queue. 

21 

• The actual number of outstanding requests supported on the 22 
Receive Queue. 23 

• The actual number of scatter/gather entries supported on 24 
Work Requests submitted to the Send Queue. 25 

The actual number of scatter/gather entries supported on 26 
Work Requests submitted to the Receive Queue. 27 

Current QP state. 28 

The following attributes are not defined if the QP is in the Reset state. 29 

30 

• PSNs for Send & Receive QPs. Applicable only for RC & UC 31 
QPs. 32 

• RDMA Read enable. 33 

• RDMA Write enable. 34 

• Atomic Operation enable. 

Primary physical port associated with this QP. Not applicable 
on RD QPs. ^'^ 

38 

39 
40 

Q_Key for the Receive Queue. Not applicable to RC, UC & 
Raw Datagram QPs. 



Primary P_Key index. Not applicable for RD & Raw Datagram 
QPs. 
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Reliable Datagram Domain. Applicable only to RD QPs. 

Destination QP number. Applicable only to RC & UC QPs. 

Number of RDMA Reads & Atomic Operations outstanding at 
any time on the destination QP. Applicable only to RC QPs. 

Number of responder resources for handling incoming RDMA 
Reads & atomic operations. Applicable only to RC QPs. 

Minimum RNR NAK Timer Field Value. When a message ar- 
rives which is targeted at a local receive queue, and that re- 
ceive queue has no receive work requests outstanding, the CI 
may respond to the initiator with an RNR NAK packet. This 
modifier is the minimum value which shall be sent In the Timer 
Field of such an RNR NAK packet; it does not affect RNR 
NAKs sent for other reasons. Applicable only to RC QPs. 

Primary Address vector, for RC & UC transports only, contain- 
ing: 

Service level. 

Send Global Routing Header Flag. 

Destination LID. If destination is in same subnet, LID = 
final destination; othen^/ise LID = router LID. 

Path MTU. 

Maximum Static Rate. 

Timeout. Applicable only to RC QPs. 

Retry count. Applicable only to RC QPs. 

RNR retry count. Applicable only to RC QPs. 

Source Path Bits. 

For global destination: 

Traffic class. 

Flow label. 

Hop limit. 

• Source GID index. 

Destination's GID (a.k.a. IPv6) address. 

Alternate path address information, returned only for RC & 
UC QPs. Valid only when automatic path migration is en- 
abled. 

Alternate path P_Key index. 
Alternate path Physical port. 
Path address vector, containing: 
Service level. 
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11.2.3.4 Destroy Queue Pair 



Send Global Routing Header Flag. 

Destination LID. If the destination is in the same subnet, 
LID = final destination; othenwise LID = router LID. 

Maximum Static Rate. 

Timeout. Applicable only to RC QPs. 

Retry count. Applicable only to RC QPs. 

RNR retry count. Applicable only to RC QPs. 

Source Path Bits. 

For global destination: 

Traffic class. 

Flow label. 

Hop limit 

Source GID index. 

Destination's GID (a.k.a. IPv6 address). 

Path migration state. Valid only if this HCA supports automatic 
path migration. 

• Verb Results: 

• Operation completed successfully. 

• Invalid HCA handle. 

• Invalid QP handle. 

Description: 

Destroys the specified QP. 

011-11: Any resources allocated by the Channel Interface in order to pro- 
cess Work Requests on the QP must be deallocated as part of the de- 
stroy operation. 

A QP instance is allowed to have Work Requests outstanding when a 
request to destroy the QP is made. When a QP is destroyed, any out- 
standing Work Requests are no longer considered to be in the scope 
of the Channel Interface. It is the responsibility of the Consumer to 
clean up resources associated with a Work Request. 

011-12: Outstanding Work Requests on this QP shall not be completed 
after this Verb returns. Incoming operations destined for a QP that has 
been destroyed are discarded. 

Input Modifiers: 

• HCA handle. 
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11.2.4 Get Special QP 



• QP handle. 
Output Modifiers: 



Verb Results: 

• Operation completed successfully. 

• Invalid HCA handle. 
Invalid QP handle. 



Description: 

Returns the handle for the specified QP type for the specified HCA 
port. The special QP types include: SMI QP (QPO), GSI QP (QP1), 
Raw IPv6 and Raw Ethertype. 

C11-13: This Verb must support QPO and QP1. 

HCA support for both Raw Datagram types is optional. 

oH-l: If the HCA supports the Raw Datagram QP types, this Verb must 
also support them. 

011-14: Handles associated with the SMI QP and the GSI QP must only 
be given out once for each QP per HCA port. Subsequent invocations of 
this Verb, without an intervening Destroy QP, must return an error. 

The single QP per port restriction does not apply to either Raw Data- 
gram QP types. 

oil -2: If Raw Datagram Service is supported, the number of Raw Data- 
gram type QPs supported per port shall be returned by the Query HCA 
Verb. 

Any fixed QP attributes for the specified QP type required by the spe- 
cific implementation are set up before returning from this Verb. For ex- 
ample, the appropriate Transport Service Type may need to be 
initialized for the QP. 

C11-15: SMI/GSI QPs shall not share a completion queue with any non- 
SMI/GSI QP. An attempt to do so shall result in an Invalid CO Handle 
error. 

Input Modifiers: 

• HCA Handle. 

• HCA port number. 

• The QP type requested. The allowed types are: 
• SMI QP (QPO). 



1 
2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 470 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Software Transport Verbs 



October 24.2000 
FINAL 



• GSIQP(QPI). 1 

• Raw IPv6. 2 

3 

• Raw Ethertype. ^ 

• The CQ to be associated with the Send Queue. g 

• The CQ to be associated with the Receive Queue. 6 

• The maximum number of outstanding Work Requests the Con- 7 
sumer expects to submit to the Send Queue. 8 

• The maximum number of outstanding Work Requests the Con- 9 
sumer expects to submit to the Receive Queue. 1 0 

• The maximum number of scatter/gather elements the Consumer 1 1 
expects to specify in a Work Request submitted to the Send 12 
Queue. 13 

• The maximum number of scatter/gather elements the Consumer 14 
expects to specify in a Work Request submitted to the Receive 1 5 
Queue. 16 

• The Signaling Type for the Send Queue on this QP. The valid 1 7 
types are: 18 

• All Work Requests submitted to the Send Queue always gen- 1 9 
erate a completion entry. 20 

• Consumer must specify on each Work Request submitted to 21 
the Send Queue whether to generate a completion entry for 22 
successful completions. 23 

Protection Domain. 24 

Output Modifiers: ^5 

26 

• QP handle. 27 

• The actual number of outstanding Work Requests supported on 28 
the Send Queue. If an error is not returned, this is guaranteed to 29 
be greater than or equal to the number requested. (This may re- 30 
quire the Consumer to increase the size of the CQ.) 31 

• The actual number of outstanding Work Requests supported 32 
through the Verbs on the Receive Queue. If an error is not re- 33 



turned, this is guaranteed to be greater than or equal to the num- 34 
ber requested. (This may require the Consumer to increase the 

size of the CQ.) 

' 36 
• The actual number of scatter/gather elements that can be speci- ^7 
fied in Work Requests submitted to the Send Queue. If an error is 
not returned, this is guaranteed to be greater than or equal to the 
number requested. 

40 
41 
42 
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• The actual number of scatter/gather elements that can be sped- 1 

fied in Work Requests submitted to the Receive Queue. If an er- 2 

ror is not returned, this is guaranteed to be greater than or equal 3 

to the number requested. ^ 

Verb Results: 5 

• Operation completed successfully. 6 
Insufficient resources to complete request. 

• Invalid HCA handle. 
Invalid Special QP type. 

• QP already in use (applies only to SMI and GSI QPs). i^ 
Number of available Raw Datagram QPs exceeded. 1 2 
Invalid Port. 
Invalid CQ handle. 



Maximum number of Work Requests requested exceeds HCA 
capability. 



11.2.5.1 Create Completion Queue 

Description: 



Creates a CQ on the specified HCA. 

The Consumer must specify the minimum number of entries in the 
CQ. 



On success, a handle to the newly created Completion Queue is re- 
turned. 

Input Modifiers: 



7 
8 
9 
10 



14 
15 
16 
17 

• Maximum number of scatter/gather elements requested ex- ^ g 
ceeds HCA capability. 

Invalid Protection Domain. 20 

• Raw Datagrams not supported. 21 

11,2.5 Completion Queue 22 

23 
24 
25 
26 
27 
28 
29 
30 
31 



The actual number of completion entries on the specified CQ is re- 
turned on successful creation. The number returned differs only when 
the number of actual entries is more than the Consumer requested. If 
the number of entries the HCA supports is less than the Consumer re- 32 
quested, an error is returned. 33 

34 
35 
36 
37 

• HCA handle. 38 

39 

• The minimum number of entries in the CQ. 

40 

Output Modifiers: 

42 
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• The handle of the newly created CQ. 1 

• The actual number of entries in the CQ. 2 

3 

• Verb Results: 

4 

• Operation completed successfully. ^ 
Insufficient resources to complete request. 6 

• Invalid HCA handle. 7 

• Number of CQ entries requested exceeds HCA capability. ^ 

9 



11.2.5.2 Query Completion Queue 

Description: 



HCA handle. 
CQ handle. 



The total number of entries in the CQ. 
Verb Results: 



HCA handle. 



10 
11 



Returns the number of entries in the specified CQ. 

13 

Input Modifiers: 

15 
16 
17 

Output Modifiers: 

19 
20 
21 

• Operation completed successfully. 22 

• Invalid HCA handle. 23 

• Invalid CQ handle. 24 

11.2.5.3 Resize Completion Queue 

Description: 



25 
26 
27 



Resizes the CQ. 28 

29 

011-16: The CI must support resizing a CQ with outstanding Work Com- 
pletions and while Work Requests are outstanding on queues associated 
with the specified CQ. Completions must not be lost as a result of a re- 
size. 32 

33 

The resize operation is allowed to adversely affect the performance 34 
while the CQ is being resized. The act of resizing is not allowed to di- 35 
rectly generate completion or asynchronous errors. 

Input Modifiers: 37 

38 
39 

CQ handle. 40 

The minimum number of entries in the CQ. 41 

42 



InfiniBand^"^ Trade Association 



Page 473 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 10 
Volume 1 - General Specifications 



Software Transport Verbs 



October 24, 2000 
FINAL 



Output Modifiers: 1 

2 

• The actual nunnber of entries in the CQ. 3 

• Verb Results: 4 

• Operation completed successfully. 5 
Insufficient resources to complete request. 

• Invalid HCA handle. g 

• Invalid CQ handle. 9 

• Number of CQ entries requested exceeds HCA capability. 1 0 
More outstanding entries on CQ than size specified. 

1 1 .2.5-4 Destroy Completion Queue ^ ^ 

Descnption: 



Destroys the specified CQ. Resources allocated by the Channel Inter- ^ 5 
face to implement the CQ must be deallocated during the destroy op- 16 



eration. 17 

011-17: The CI shall return an error if this Verb is invoked while a Work 1^ 
Queue is still associated with the CQ. 19 

20 

Any completions that have not been retrieved from the CQ prior to 21 
being destroyed are discarded. 22 

Input Modifiers: 23 

24 

HCA handle. 25 

• CQ handle. 26 

Output Modifiers: 27 

28 

• Verb Results: 29 

• Operation completed successfully. 30 

• Invalid HCA handle. 31 

• Invalid CQ handle. 

33 

One or more Work Queues is still associated with the CQ. 
11.2.6 EE Context 35 
11.2.6.1 Create EE Context 36 

Description: 37 

38 

Creates an EE Context for the specified HCA. 39 

On success, a handle to the newly created EE Context is returned. 40 

41 
42 
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The values for Remote Node Address Handle, Send Sequence 1 
Number, Receive Sequence number are all zero. The EE Context is 2 
created in the Reset state. 3 

Input Modifiers: 4 

5 

• HCA handle. 5 

• Reliable Datagram Domain. 7 

Output Modifiers: 8 

9 

• The handle for the newly created EE Context. 10 

• Verb Results: 11 

Operation completed successfully. 

13 

• Insufficient resources to complete request. 

• Invalid HCA handle. 

• Reliable Datagrams not supported. 16 

• Invalid Reliable Datagram Domain. 17 
11.2.6.2 Modify EE Context Attributes 

19 

Description: 

20 

oil -3: If the CI supports RD Service, upon invocation of this Verb the CI 21 
shall modify the attributes for the specified EE Context and then shall 22 
cause the EE Context to transition to the specified EE Context state. 23 

24 

Only a subset of the attributes can be modified once the EE Context 25 
has been created. 

26 

EE Context attributes can be modified with Work Requests out- 27 
standing which use the EE handle, but any such Work Requests might 23 
not execute correctly if the modifiers are changed. 

WRs can be outstanding on the Receive Queue when the EE is in the 3Q 
Init, RTR, and RTS state and on the Send Queue when the EE is in 
the RTS state. Any outstanding Work Requests on the QP may not 
execute as expected if the EE modifiers are changed. 



31 
32 
33 

If any invalid attribute or value is specified, an error is returned and all 



EE Context attributes remain unchanged. 



35 
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o11-4: If the CI supports RD Service, the properties and requirements of 
the EE Context state transitions shall be supported as shown in Table 82. 



Table 82 : EE Context State Transition Properties 


Transition 


Required Attributes 


Optional Attributes 


Actions 


Reset to 
Init 


Physical port. 
P_Key index. 


None. 


Enable posting to the 
receive queue. 


Init to RTR 


Remote Node Address 
RQ PSN. 

Number of responder 
resources for RDMA 
Read/atomic ops. (Note 
this is 1 for this revision.) 


Alternate destination 
None. 


Activate receive process- 
inn 

II ly. 


RTR to RTS 


Timeout. 
Retry count. 
RNR retry count. 
SQ PSN. 

Number of Outstanding 
RDMA Read/atomic ops 
at destination. 


Alternate destination 

node address 

Path migration state. 


Activate send processing. 


RTS to RTS 
(no transition) 


None. 


Alternate destination 
node address. 
Path migration state. 


No transition. 


RTS to SQD 


None. 


None. 


Deactivate send process- 
ing. 


SOD to RTS 


None 


Remote Node Address 
Vector. 

Alternate destination 
node address. 
Channel migration state. 
Timeout/Retry values. 
Number of Outstanding 
RDMA Read/atomic ops 
at destination. 

Number of local RDMA 
Read/atomic responder 
resources. 
Q_Key 


Activate send processing. 


Any state to Error 


None. 


None allowed. 


Queue processing is 
stopped. 

Work Requests in pro- 
cess are completed in 
error, when possible. 
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Table 82 : EE Context State Transition Properties 


Transition 


Required Attributes 


Optional Attributes 


Actions 


Any state to Reset 


None. 


None allowed. 


EE attributes are reset to 
the same values after the 
EE was created. 
Outstanding Work 
Requests are removed 
from the queues without 
notifying the Consumer. 



Input Modifiers: 



HCA handle. 

EE Context handle. 

The EE Context attributes to modify and their new values. The EE 
Context attributes that can be modified after the EE Context has 
been created are: 

• Primary path Physical Port. 
Primary path P_Key Index. 

• PSNs for Sends & Receives. 

• EE Context State.Enable or disable Send Queue Drained, 
Asynchronous Affiliated Event Notification. This modifier 
is only applicable when the next EE state chosen is SQD. 

Number of RDMA Reads & Atomic Operations outstanding at 
any time on the destination EE. 

Number of responder resources for handling incoming RDMA 
Reads & atomic operations. This value may be rounded up to 
a supported value, not to exceed the maximum value allow- 
able for EEs for this HCA. Note for this version of the specifi- 
cation, this value is one. 

Destination EE Context number 

• Primary Address vector, containing: 
Service level. 

Send Global Routing Header Flag. 

Destination LID. If destination is in same subnet, LID = 
final destination; otherwise LID = router LID. 

Path MTU. 

Maximum Static Rate. 
Timeout. 
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Retry count. 1 

• RNR retry count. 2 

3 

• Source Path Bits. 

4 

• For global destination: g 

• Traffic class. s 
Flow label. 7 

• Hop limit. 8 

9 

Source GID index. 

10 

Destination's GID (a.k.a. IPv6) address. 

• Alternate path address information. Valid only when automatic -| 2 
path migration is enabled. ^3 

• Alternate path P_Key index. 14 

• Alternate path Physical port. 1 5 
Alternate path address vector, containing: 

Service level. 

18 

• Send Global Routing Header Flag. 

Destination LID. If the destination is in the same subnet, 20 
LID = final destination; otherwise LID = router LID. 21 

• Maximum Static Rate. 22 
Timeout. 23 

• Retry count. 

25 

RNR retry count. 

26 

• Source Path Bits. 27 
For global destination: 28 

Traffic class. 29 

• Flow label. 

31 

• Hop limiL ^2 

• Source GID index. 33 

• Destination's GID (a.k.a. IPv6 address). 34 

• Path migration state. Valid only if this HCA supports automatic 35 
path migration. Valid states to set are: 36 

Migrated. 37 

Rearm. 

39 

Output Modifiers: 
• Verb Results: 

42 
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Operation completed successfully. 1 
Insufficient resources to complete request. 2 

Invalid HCA handle. ^ 

4 

Invalid EE Context handle. c 
Cannot change EE Context attribute. 5 
Invalid EE Context state. 7 

Invalid path migration state. 8 

9 

Reliable Datagrams not supported. ^ 

11 

Description: 12 

13 
14 
15 

Input Modifiers: 

17 
18 
19 

Output Modifiers: 20 

21 
22 
23 
24 

Send Queue Draining. This modifier is only applicable 25 
when the EE is in the SOD state. 



Returns the attribute list and current values for the specified EE Con- 
text. 



HCA handle. 

EE Context handle. 



The EE Context attributes. The list of attributes returned by the 
query are: 

• EE Context State. 



26 

• Send Queue Drained. This modifier is only applicable 27 
when the EE is in the SQD state. 23 

EE Context Number. 29 

The following attributes are not defined if the EE is in the Reset 30 
state. 31 

Primary path Physical Port. ^2 

33 

• Primary path P_Key Index. 

PSNs for Sends & Receives. ^5 

Reliable Datagram Domain. 35 

Number of RDMA Reads & Atomic Operations outstanding at 37 
any time on the destination EE. 33 

Number of responder resources for handling incoming RDMA 39 
Reads & atomic operations. 40 

Destination EE Context number. 41 

42 
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• Primary Address vector, containing: 1 

Service level. ^ 

3 

• Send Global Routing Header Flag. 

4 

• Destination LID. If destination is in same subnet. LID = g 
final destination; otherwise LID = router LID. ^ 

• Path MTU. J 
Maximum Static Rate. 8 

• Timeout. 9 

• Retry count 

11 

• RNR retry count. 

• Source Path Bits. ^ 3 
For global destination: 14 

• Traffic class. 15 

• Flow label. 
Hop limit. 

• Source GID index. 

Destination's GID (a.k.a. IPv6) address. 20 

• Alternate path address information. Valid only when automatic 21 
path migration is enabled. 22 

• Alternate path P_Key index. 23 

24 

• Alternate path Physical port. 

25 

Alternate path address vector, containing: 

• Service level. 27 

• Send Global Routing Header Flag. 28 

Destination LID. If the destination is in the same subnet, 29 
LID = final destination; otherwise LID = router LID. 30 

Maximum Static Rate. 3^ 

32 

• Timeout. 

33 

Retry count. 

RNR retry count 35 

• Source Path Bits. 36 
For global destination: ^7 

• Traffic class. '^^ 

39 

Flow label. 

40 

Hop limit. 4^ 

42 
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11.2.6.4 Destroy EE Context 



11.2.7 Memory Management 



• Source GID index. 

• Destination's GID (a.k.a. IPv6 address). 

• Path migration state. Applicable only if this HCA supports au- 
tomatic path migration. 

• Verb Results: 

• Operation completed successfully. 

• Invalid HCA handle. 

• Invalid EE Context handle. 

• Reliable Datagrams not supported. 

Description: 

Destroys the specified EE Context. Any resources allocated by the 
Channel Interface for use by the EE Context are freed from use. 

o11-5: If the CI supports RD Service, after this Verb is invoked, any out- 
standing or subsequently submitted Work Requests which depend on the 
EE Context shall complete with an Invalid EE Context Number error. 

Input Modifiers: 

• HCA handle. 

• EE Context handle. 
Output Modifiers: 

• Verb Results: 

• Operation completed successfully. 

• Invalid HCA handle. 

• Invalid EE Context handle. 

• Reliable Datagrams not supported. 

Memory Management Verbs are partitioned into two categories: 

1 ) Registration of memory regions. 

The Verbs used to register memory regions are the following: 
Register Memory Region. 

• Register Physical Memory Region. 

• Query Memory Region. 

• Deregister Memory Region. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand®'^ Trade Association 



Page 481 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Software Transport Verbs October 24, 2000 

Volume 1 - General Specifications FINAL 

Reregister Mennory Region. 1 

Reregister Physical Memory Region. 2 

3 

Register Shared Memory Region. 

4 

2) Binding of Memory Windows. 5 
The Verbs used to allocate and bind Memory Windows are following: 6 
• Allocate Memory Window, ^ 

Q 

Query Memory Window. 
Bind Memory Window. 

Deallocate Memory Window. 11 
11.2.7.1 Register Memory Region 12 

Description: 13 

14 

Prepares a virtually addressed memory region for use by an HCA. A 15 
description of the registered memory suitable for use in Work Re- 
quests to describe locally accessible memory locations is returned. 
When specifically requested, a description of the registered memory 
suitable for use by inbound RDMA and/or atomic operations is re- ^ ^ 
turned. 19 



20 
21 



This Verb depends on OSV supplied functions to perform the pinning 
of memory pages and creating the virtual to physical translations that 
represent the memory region. 22 

23 

Input Modifiers: 

24 

HCA Handle. 25 

2B 

Virtual Address - the address of the first byte of the region to be 
registered. The Maximum size of a Virtual Address is 64 bits. 27 

go 

Length of region to be registered in bytes. 

29 

Protection Domain to be assigned to the registered region. 

Access Control - The following may be selected in any combina- 31 
tion except as noted. 32 

• Enable Local Write Access. 33 

• Enable Remote Write Access. 34 

Remote Write Access requires Local Write Access to be en- 
abled. 36 

Enable Remote Read Access. 

38 

• Enable Remote Atomic Operation Access (If Atomic Ops sup- 
ported). 

40 

Remote Atomic Operation Access requires Local Write Ac- 
cess. 

42 
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Enable Memory Window Binding. 1 

Note: Local Read Access is always implied. 2 

3 

Output Modifiers: ^ 

Memory Region Handle - used to identify this specific registered 5 
region to the Memory Management Verbs. 6 

• L_Key - used for local access. ^ 
R_Key - used for remote access. 

The R_Key is returned only when Remote Access was requested. 

• Verb Results: 11 
Operation completed successfully. 12 
Insufficient resources to complete request. ^ ^ 
Invalid HCA handle. 

15 

Invalid Virtual Address. ^ g 

Invalid Length 17 

Invalid Protection Domain. 18 

Invalid Access Control specifier. ^ ^ 

20 

11.2.7.2 Register Physical Memory Region 

21 

Description: 22 

Prepares a physically addressed memory region for use by an HCA. 

A descripfion of the registered memory suitable for use in Work Re- 24 

quests to describe locally accessible memory locations is returned. 25 

When specifically requested, a description of the registered memory 26 

suitable for use by inbound RDMA and/or atomic operations is re- 27 

turned. 28 

In addifion to a list of physical buffers, the Consumer supplies a re- 29 
quested "I/O Virtual Address" to be associated with the first byte of the 3Q 
Region. The Consumer also supplies the length of the enfire Region 
plus a byte offset that specifies where the Region begins within the 
first physical buffer. The Channel Interface returns the I/O Virtual Ad- 
dress that is actually assigned for the Region. 33 

34 

35 

HCA Handle. 36 

37 

38 

List of Physical Buffers - Each buffer must begin and end on 
an HCA-supported page boundary. 

Total number of Physical Buffers in the list. 

42 
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I/O Virtual Address - lOVA requested by the Consumer for the 1 

first byte of the region. 2 

Length of Region to be registered in bytes. 3 

Offset of Region's starting lOVA within the first physical buffer. 

Protection Domain to be assigned to the registered region 

Access Control - The following may be selected in any combina 
tion except as noted. 

Enable Local Write Access. 9 

Enable Remote Write Access. 1 0 

Remote Write Access requires Local Write Access to be en- 11 

abled. 12 

Enable Remote Read Access. ^ ^ 

14 

• Enable Remote Atomic Operation Access (If Atomic Ops sup- 
ported). 

16 

Remote Atomic Operation Access requires Local Write Ac- 
cess. 

18 

Enable Memory Window Binding. 

Note: Local Read Access is always implied. 20 

Output Modifiers: 21 

22 

Memory Region Handle - used to identify this specific registered 23 

region to the Memory Management Verbs. 24 

• I/O Virtual Address - lOVA actually assigned by the Channel In- 25 
terface for the first byte of the Region. 2g 

L_Key - used for local access. 27 

R_Key - used for remote access. 28 

The R_Key is returned only when Remote Access was requested. 29 

• Verb Results: 

31 

Operation completed successfully. 22 

Insufficient resources to complete request. 33 

• Invalid HCA handle. 34 

Invalid Physical Buffer List entry. 35 

36 

Invalid Length. 



Invalid Offset. 



37 

38 

Invalid Protection Domain. 39 

Invalid Access Control specifier. 40 

41 
42 
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11 .2.7.3 Query Memory Region 1 

Description: 2 

3 

Retrieves information about a specific memory region. 4 

Input Modifiers: 5 

6 

• HCA Handle. 7 
Memory Region Handle - as issued when region was registered. 8 

Output Modifiers: ^ 

10 

L_Key - as issued when region was registered. 11 

• R_Key - as issued when region was registered. ^ 2 

• Actual Local Protection Bounds enforced by the Channel Inter- 



face. 

Actual Remote Protection Bounds enforced by the Channel Inter- 
face. 

The Remote Protection Bounds are returned only when Local and 
Remote Access to the region was requested. 



13 
14 
15 
16 
17 
18 
19 

• Protection Domain assigned to the registered region. 20 

• Access Control settings for the registered region. 21 

• Verb Results: 22 

• Operation completed successfully. 23 

• Invalid HCA handle 

25 

• Invalid Memory Region handle. 25 
11,2.7.4 Deregister Memory Region 27 

Description: 28 

29 

Removes a memory region from the HCA translation table. The region 
is unpinned if pinned in the associated registration Verb. This Verb is 
responsible only for deallocating resources allocated as part of the as- 
sociated registration operation. All other resources are the responsi- ^2 
bility of the Consumer. 33 

34 

It is an error for a Consumer to attempt to deregister a Memory Region 
while it still has any Memory Windows bound to it. Channel Interface 
implementations have options on how to deal with the error, described 36 
in 10.6.6.2.4 Dereaisterinq Regions with Bound Windows on page 37 
413 . 38 

Work Requests or Remote Operation requests that are in process and 39 

actively referencing memory locations in a Memory Region that is 40 

deregistered must fail with a protection violation. Work Requests or 41 

Remote Operation requests that attempt to access memory locations 42 
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in a Memory Region that has been deregistered must fail with a pro- 1 
tection violation. 2 

This Verb depends on the availability of OSV supplied functions to per- 3 
form the unpinning of memory pages. 4 

Input Modifiers: ^ 

6 

• HCA Handle. 7 

Memory Region Handle - as issued when region was registered. ^ 

9 

Output Modifiers: 



10 
11 
12 
13 
14 



No output modifiers. 
• Verb Results: 

Operation completed successfully. 
Invalid HCA handle. -I5 
Invalid Memory Region handle. 16 
• Operation denied; Region still has bound Window(s) '17 
11 .2.7.5 Reregister Memory Region ^ ^ 

Descnption: 

Modifies the attributes of an existing Memory Region. Any existing Re- 
gion owned by the Consumer can be modified, regardless of which 22 
Verb created it initially\ or which Verb (if any) reregistered it most re- 23 
cently^. A description of the Memory Region suitable for use in Work 24 
Requests to describe locally accessible memory locations is returned. 25 
When specifically requested, a description of the Memory Region suit- 
able for use by inbound RDMA and/or atomic operations is returned. 



29 
30 



26 
27 

This Verb conceptually performs the functions Deregister Memory Re- 23 
gion followed by Register Memory Region. Where possible, resources 
below the Verb layer are expected to be reused instead of deallocated 
and reallocated. This Verb may be used to change the access rights 
and/or protection domain of a region, as well as changing the memory 31 
locafions that are registered. 32 

The L_Key and R_Key output modifiers from this Verb must be used 
in place of any previously issued for this region. 34 

35 

This Verb depends on the availability of OSV supplied functions to per- 
form the pinning and unpinning of memory pages and creating the vir- 
tual to physical translations that represent the memory region. 37 

38 

1 . For instance, a Region created with Register Physical Memory Region can 39 
later be modified by Reregister Memory Region. 40 

2. For instance, a Region modified by Reregister Physical Memory Region can 41 
later be modified by Reregister Memory Region. ^2 
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It is an error for a Consumer to attempt to reregister a Memory Region 1 
while the Region still has any Memory Windows bound to it. Channel 2 
Interface implementations have options on how to deal with the error, 3 
as described in 10.6.6.2.4 Dereaisterina Regions with Bound Windows on 
page 413 . 



4 

5 

011-18: If the CI returns the "Operation denied" (due to a bound Window) g 
error, the CI shall make no change to the current registration. 



Access Control Selections. 



011-19: If the CI returns either the "Invalid HCA handle" or "Invalid 
Memory Region handle" error, the CI shall make no change to the current 
registration (assuming that it even exists). 10 

11 

011-20: If the CI returns any other error, the CI shall invalidate both "old" 12 
and "new" registrations, and release any associated resources. ^3 

14 

011-21 : For the error case where a remote agent is accessing a Memory 
Region while it is in the process of being reregistered, the CI must present ^ ^ 
the same semantics as a deregistration operation followed by a separate 1 6 
registration operation. 17 

18 

Input Modifiers: -jg 

• HCA Handle. 

21 

• Memory Region Handle - as issued when region was registered. 22 

Change Request type - The following may be selected in any 23 
combination, the input modifiers required to support the request 24 
are listed below each request. 25 

Change Translation. 26 

Input Modifiers required. 27 

Virtual Address. ^8 



29 
30 
31 



Length. 
Change Protection Domain. 
Input Modifiers required. 32 

Protection Domain. 33 

Change Access Control. 34 

35 

Input Modifiers required. 



36 
37 

Output Modifiers: 33 



39 
40 



Memory Region Handle - must be used for future references to 
this Memory Region. Might or might not be the same as the previ 
ous Region Handle. 

42 
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Usage Example: 



L_Key - used for local access. 1 

R_Key - used for remote access. 2 

3 

The R_Key is returned only when Remote Access was requested. ^ 
Verb Results: g 
Operation completed successfully. 6 
Insufficient resources to complete request. 7 

Invalid HCA handle. 8 

9 

Invalid Memory Region handle. 

Invalid Virtual Address. ^ ^ 

Invalid Length. 12 
Invalid Protection Domain. 13 
Invalid Access Control specifier. 

15 

Operation denied; Region still has bound Window(s) 

1 6 
17 
1 8 

a) To modify only the Access Control of an already registered re- 
gion, the Memory Region Handle, a Change Access Control Re- 
quest and the new Access Control Selections input modifiers 20 
would be supplied to the Verb. 21 

b) To change the address translations of a region the Memory Re- 22 
gion Handle, a Change Translation Request and the new Virtual 23 
Address and length input modifiers would be supplied to the Verb. 24 
The pages previously pinned would be unpinned, the new memo- 25 
ry region would be pinned and registered (and if requested 26 
bound) using the region's access controls and protection domain. 
Previous translations would be removed or replaced as needed. 

28 

11.2,7.6 Reregister Physical Memory Region 29 

Description: 30 

31 

Modifies the attributes of an existing Memory Region. Any existing Re- 
gion owned by the Consumer can be modified, regardless of which 
Verb created it initially'', or which Verb (if any) reregistered it most re- 

cently^. A description of the Memory Region suitable for use in Work 34 

Requests to describe locally accessible memory locations is returned. 35 

When specifically requested, a description of the Memory Region suit- 36 

able for use by inbound RDMA and/or atomic operations is returned. 37 

38 



1. For instance, a Region created with Register Memory Region can later be 
modified by Reregister Physical Memory Region. 40 

2. For instance, a Region modified by Reregister Memory Region can later be 41 
modified by Reregister Physical Memory Region. ^2 
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This Verb conceptually performs the functions Deregister Memory Re- 1 

gion followed by Register Physical Memory Region. Where possible, 2 

resources below the Verb layer are expected to be reused instead of 3 

deallocated and reallocated. This Verb may be used to change the ac- ^ 
cess rights and/or protection domain of a region, as well as changing 
the memory locations that are registered. 



The L_Key and R_Key output modifiers from this Verb must be used 
in place of any previously issued for this region. 



HCA Handle. 



10 
11 
12 
13 
14 



5 
6 
7 
8 

It is an error for a Consumer to attempt to reregister a Memory Region 9 
while the Region still has any Memory Windows bound to it. Channel 
Interface implementations have options on how to deal with the error, 
as described in 10.6.6.2.4 Dereqisterinq Regions with Bound Win- 
dows on page 413 . 

011-22: For the Reregister Physical Memory Region Verb, the CI shall 
conform to all of the compliance statements contained in 11.2.7.5 Rereg- 
ister Memory Region. ^ 5 

16 

Input Modifiers: 17 

18 
19 

Memory Region Handle - as issued when region was registered. 20 

Change Request type - The following may be selected in any 21 
combination, the input modifiers required to support the request 22 
are listed below each request. 23 

Change Translation. 24 

Input Modifiers required. 25 

• Physical Buffer List ^® 

27 

• List of Physical Buffers - Each buffer must begin and 23 
end on an HCA-supported page boundary. 

29 

• Total number of Physical Buffers in the list. 

I/O Virtual Address - lOVA requested by the Consumer for 3-] 
the first byte of the region. ^2 

Length of Region to be registered in bytes. 33 

• Offset of Region's starting lOVA within the first physical 34 
buffer. 35 

• Change Protection Domain. 36 

37 

Input Modifiers required. 

38 

Protection Domain. 

• Change Access Control. 49 

Input Modifiers required. 41 

42 
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• Access Control Selections. 1 

Output Modifiers: 2 

3 

• Memory Region Handle - must be used for future references to 4 
this Memory Region. Might or might not be the same as the previ- 5 
ous Region Handle. g 

• I/O Virtual Address - lOVA actually assigned by the Channel In- 7 
terface for the first byte of the Region. 8 

L_Key - used for local access. 9 

R_Key - used for remote access. "lO 

11 

The R_Key is returned only when Remote Access was requested. ^ ^ 

• Verb Results: ^2 

Operation completed successfully. 14 
Insufficient resources to complete request. 15 

• Invalid HCA handle. 16 

17 

Invalid Memory Region handle. 
Invalid Virtual Address. 

Invalid Length. 20 

• Invalid Offset. 21 

Invalid Protection Domain. 22 

23 

Invalid Access Control specifier. 

24 

• Operation denied; Region still has bound Window(s) ^5 
11.2.7.7 Register Shared Memory Region 26 

Description: 27 

28 

Given an existing Memory Region, a new Memory Region associated 29 
with the same physical memory locations is created, with the intention 
that the new Memory Region share HCA mapping resources to the ex- 
tent possible. Through repeated calls to the Verb, an arbitrary number 31 
of Memory Regions can potentially share the same HCA mapping re- 32 
sources, all associated with the same physical memory locations. 33 

The Virtual Address, Protection Domain, and Access Rights specified 34 
for the new Memory Region need not be the same as those of the ex- 35 
isting Memory Region. The lengths are by definition the same. 36 

The Consumer supplies a requested Virtual Address to be associated 37 
with the first page in the new Memory Region, and the Channel Inter- 38 
face returns the Virtual Address that is actually assigned. 39 

Input Modifiers: ^0 

41 

42 
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• HCA Handle. 1 

Memory Region Handle - of an already registered region. 2 

3 

• Virtual Address - requested by the Consumer for the first page of 
the buffer. ^ 

5 

6 

• Access Control Selections. ^ 
Output Modifiers: 8 

9 

Memory Region Handle - of the new Memory Region. 

• Virtual Address - actually assigned by the Channel Interface for 11 
the first page. ^2 

L_Key - used for local access. 13 

R_Key - used for remote access. 14 

The R_Key is returned when Remote Access Rights are re- 
quested. 16 

• Verb Results: 



11.2.7.8 Allocate Memory Window 

Descrip 



Operation completed successfully. 



17 
18 
19 

Insufficient resources to complete request. 20 
Invalid HCA handle. 21 

Invalid Memory Region handle. 22 

23 
24 
25 
26 

ion: 27 



Invalid Protection Domain. 
Invalid Access Control specifier. 



28 

This Verb allocates a memory window which is associated with a protec- 29 
tion domain. It is not inherently associated with any memory region when 
allocated. 

31 

Input Modifiers: 32 

33 

• HCA Handle. 34 
Protection Domain to be assigned to the Memory Window. 35 

Output Modifiers: 36 

37 

• Window Handle - used to identify this specific Memory Window to 38 
other Memory Management Verbs. 39 

R_Key - an unbound R_Key for use in specifying the Window with 40 
the Bind Memory Window Verb. 41 

42 
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• Verb Results: 1 

• Operation completed successfully. ^ 

3 

Insufficient resources to complete request. 

4 

• Invalid HCA handle. g 
Invalid Protection Domain. 6 

11.2.7.9 Query Memory Window 7 

Description: ^ 

9 

This Verb returns the attributes associated with the specified memory 1 o 
window. ^1 

Input Modifiers: 12 

13 

• HCA Handle. 14 

• Window Handle - as issued by an Allocate Memory Window. 1 5 
Output Modifiers: 

17 

• R_Key - the current R_Key associated with the Memory Window. 1 8 

• Protection Domain associated with the Memory Window. 19 

• Verb Results: 

21 

Operation completed successfully. 22 

• Invalid HCA handle. 23 

• Invalid Memory Window handle. 24 

11.2.7.10 Bind Memory Window 25 

Description: 26 

27 

Posts a Work Request to a specified Send Queue, which binds a 28 
Memory Window to a specified VA range and remote access attributes 29 
based on an existing Memory Region. The QP Service Type must be 
either Reliable Connection, Unreliable Connection, or Reliable Data- 
gram. 



30 
31 
32 

The specified VA range must either be the entire Memory Region or a 
subset of it. Remote Write Access or Remote Atomic Access must not ^ , 

3-4 

be specified unless the Memory Region has Local Write Access. The 
QP, Memory Window, and Memory Region must belong to the same 
HCA and Protection Domain. 36 

37 

A previously bound Memory Window can be bound to a new VA range 
in the same or a different Memory Region, causing the previous 
binding to be invalidated. Binding a previously bound Memory Window 39 
to a zero-length VA range will invalidate the previous binding and re- 40 
turn an R_Key that is in the unbound state. 41 

42 
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The Bind operation has a unique ordering rule: any Work Request 1 
posted to a Send Queue subsequent to a Bind must not begin execu- 2 
tion until the Bind operation completes. 3 

Under normal operation, it is improper for a Consumer to change the 4 

binding of a Memory Window while it is being accessed by a remote 5 

agent. However, this can occur if remote agents misbehave, or it can g 
occur under error recovery circumstances. Any Remote Operation re- 
quests that are in process and actively using a Memory Window when 
its binding is changed must fail with a protection violation. Once the 

Bind operation has been reported to the Consumer as having com- ^ 

pleted, the Channel Interface must guarantee that no additional ac- 10 

cesses can be performed under the immediate previous binding. 11 

Input Modifiers: 12 

13 

HCA Handle. 14 

QP Handle. 15 

16 
17 
18 
19 

Memory Region Handle. 2o 

L_Key - The L_Key for the Memory Region that the Memory Win- 21 
dow will be associated with. 22 

Virtual Address - the address of the first byte of the bound range. 23 
The Maximum size of a Virtual Address is 64 bits. 24 

Length of range to be bound in bytes. 25 

Access Control - The following may be selected in any combina- 26 
tion except as noted. 27 

Enable Remote Write Access. 28 

29 

Requires the Memory Region to have Local Write Access. 

• Enable Remote Read Access 2^ 

• Enable Remote Atomic Operation Access (If Atomic Ops sup- 32 
ported) 23 

Requires the Memory Region to have Local Write Access. 34 

• Completion notification type. Must be specified and is only valid if 35 
the Send Queue was set up for selectable signaling. 35 

37 

Output Modifiers: 

39 

R_Key - The R_Key associated with the new binding, whose val- 40 
ue is different from that of the supplied R„Key. 41 

42 
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• Verb Results: 1 

• Operation completed successfully. 2 

3 

Insufficient resources to complete request. 

• Invalid HCA handle ^ 

b 

• Invalid QP handle. 6 

• Invalid Service Type for this QP. 7 

• Invalid Memory Window handle. ^ 

• Invalid R Key. ^ 

10 

• Invalid Memory Region handle. 

Invalid L_Key. 12 

• Invalid Virtual Address 1 3 
Invalid Length. 

15 

Invalid Access Control specifier. 

16 

Invalid completion notification type. 

• Work Request Completion Status 

Operation completed successfully. 19 

• Protection Error. 20 

21 
22 
23 

Under normal operation, it is improper for a Consumer to deallocate a ^4 
Memory Window while it is being accessed by a remote agent. How- 25 
ever, this can occur if remote agents misbehave, or it can occur under 26 
error recovery circumstances. Any Remote Operation requests that 27 
are in process and actively using a Memory Window when it is deallo- 28 
cated must fail with a protection violation. Once the deallocation Verb 29 
completes, the Channel Interface must guarantee that no additional 
accesses can be performed through that Memory Window. 



11,2,7.11 Deallocate Memory Window 

Description 



30 
31 

Input Modifiers: 22 

33 
34 
35 

Output Modifiers: 3g 

37 
38 
39 

Operation completed successfully. 40 

Invalid HCA handle 41 

42 



HCA Handle. 

Window Handle - as issued by an Allocate Memory Window. 



No output modifiers 
Verb Results: 
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Invalid Memory Window handle. 



11.3 Multicast 

11.3.1 Attach QP TO Multicast Group 

Description: 



2 
3 
4 
5 
6 



Attaches the QP to the specified multicast group. The only function of 7 

this Verb is to assign the Receive Work Queue of this QP to the spec- 3 

ified multicast group; after the attachment completes, this QP will be g 
provided with a copy of every multicast message addressed to the 
specified group and received on the HCA port with which the QP is as- 
sociated. Creation of the multicast group, and reconfiguration of the 

fabric such that packets addressed to that group are routed to a local 12 

HCA port, is described in 7.10 IBA and Raw Packet Multicast on page 1 3 

179. 14 

The Service Type of the specified QP must be Unreliable Datagram. It 15 

is an error to specify a QP with any other Service Type. 1 6 

One or more QPs are allowed to be attached to a multicast group on 1 7 

the HCA. If the maximum number of multicast group attachments has 1 8 

already been reached for the HCA when a QP attempts to attach to 19 

the multicast group, an error is returned. 20 

The input modifier which determines the multicast group to attach to 21 

can be either a DLID, an IPv6 Address or both. 22 

The IBA unreliable multicast feature is optional. This Verb is required 23 

only if IBA unreliable multicast is supported by the HCA. 24 

Input Modifiers: 25 



Output Modifiers: 



• Multicast group DLID. 

• Multicast group IPv6 Address. 

• QP Handle. 



Verb Results: 



HCA Handle. 



Operation completed successfully. 
Insufficient resources to complete request. 
Invalid HCA handle. 



Invalid Multicast group IPv6 Address. 

Invalid QP handle. 

Invalid Service Type for this QP. 



Invalid multicast DLID. 



26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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Number of QPs attached to multicast groups exceeded. 1 

11.3.2 Detach QP FROM Multicast Group 2 

3 

Description: ^ 

Detaches the specified QP from a multicast group. The only function 5 

of this Verb is to detach the Receive Work Queue of this QP from the 6 

specified multicast group. 7 

All of the input modifiers must be correct for the QP to be detached. If 8 

the QP is attached to a different multicast group or port, an error will 9 

be returned. >I0 

This Verb is required only if IBA unreliable multicast is supported by 11 

the HCA. The IBA unreliable multicast feature is optional. 12 

Input Modifiers: 13 

14 

• HCA Handle. ^5 

• HCA port number. 16 
Multicast group DLID. 

Multicast group IPv6 Address. 

• QP Handle. 



Verb Results: 



18 
19 
20 

Output Modifiers: 21 

22 
23 

Operation completed successfully. 24 
Invalid HCA handle. 25 
Invalid HCA port number. 26 
Invalid multicast DLID. 
Invalid Multicast group IPv6 Address 
Invalid QP handle. 



11.4 Work Request Processing 



27 
28 
29 
30 
31 
32 

11.4.1 Queue Pair Operations 33 
11.4,1.1 Post Send Request 34 

Description: 35 

36 

Builds a WQE for the Send Queue in the specified QP from the infor- 37 
mation contained in the Work Request submitted by the Consumer, 



This WQE is added to the end of the Send Queue and the HCA is no- 
tified that a new WQE is ready to be processed. 



39 
40 
41 
42 
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If the Send Queue is enabled for selectable completion notification, 
the Consumer must specify whether a successful completion of this 
Work Request results in a completion entry on the CQ. 

Control returns to the Consumer immediately after the WQE has been 
submitted to the Send Queue and the HCA has been notified that a 
new WQE is ready to process. When control returns, the Work Re- 
quest is in the scope of the Consumer and will no longer be modified 
or accessed below the Channel Interface. 

011-23: The CI shall return control to the Consumer immediately after the 
Work Request has been submitted to the Send Queue. 

011-24: Once control has been returned to the Consumer the CI shall not 
modify or access the Work Request. 

Sends, RDMA and atomic operations can all take place on the same 
QP. Table 83 Operation Type Matrix shows which operations are al- 
lowed for each Service Type of the QR 

011-25: The CI shall support the operations based on QP Service type 
according to Table 83 Operation Type Matrix . 

Table 83 Operation Type Matrix 





Send 


RDMA Read 


RDMA Write 


Atomic Ops 


Reliable 
Connected 


Yes 


Yes 


Yes 


Yes 


Reliable 
Datagram 


Yes 


Yes 


Yes 


Yes 


Unreliable 
Connected 


Yes 


Not allowed 


Yes 


Not allowed 


Unreliable Dat- 
agram 


Yes 


Not allowed 


Not allowed 


Not allowed 


Raw 

Datagram 


Yes 


Not allowed 


Not allowed 


Not allowed 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



The ordering and fencing considerations for Atomic Operations are 
the same as for RDIVIA Read. 

Not all of the Input Modifiers are valid for all operations. Table 84 : 
Work Reouest Modifier Matrix shows which of the Input Modifiers are 
valid for each operation. If Input Modifiers are specified that are not 
valid for a particular operation, they are ignored. 
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C11 -26: The CI shall ignore all input modifiers in a Work Request that are 
not valid for the specified operation as shown in Table 84 : Work Request 
Modifier Matrix. 



Table 84 : Work Request Modifier Matrix 





Send 


RDMA Read 


RDMA Write 


Atomic Ops 


Work Request ID 


Required 


Required 


Required 


Required 


Completion 
notification indicator 


Required if 
Send Queue 
selectable 
signaled 


Required if 
Send Queue 
selectable 
signaled 


Required if 
Send Queue 
selectable 
signaled 


Required if 
Send Queue 
selectable 
signaled 


Scatter/Gather list 


Required^ 


Required^ 


Required® 


N/A 


# of Data Segments 


Required^ 


Required^ 


Required® 


N/A 


immeuiaie uaia 


v^puunai 
except 
N/A for Raw 
Datagram QPs 


M/A 


kjpxionai 


M/A 
IN/M 


Fence Indicator 


Optional for 
Reliable QPs 


Optional for 
Reliable QPs 


Optional for 
Reliable QPs 


Optional for 
Reliable QPs 


Kemoie iNoue Mouress 


Muuress nan- 
die Required 
for UD QPs, 
DLID & SL 
Required for 
Raw 


M/A 


M/A 


M/A 


Remote Node OP # and 
Q_Key 


Required for IB 
Datagram QPs 


Required for 
Reliable 
Datagram QPs 


Required for 
Reliable 
Datagram QPs 


Required for 
Reliable 
Datagram QPs 


EE Context 


Required for 
Reliable 
Datagram QPs 


Required for 
Reliable 
Datagram QP 


Required for 
Reliable 
Datagram QP 


Required for 
Reliable 
Datagram QP 


Remote address 
and R_Key 


N/A 


Required 


Required 


Required 


Atomic operands 


N/A 


N/A 


N/A 


Required 


Solicited Event 


Optional 


N/A 


Optional with 

Immediate 

Data 


N/A 


Ethertype 


Required for 
Raw 

Ethertype QPs 


N/A 


N/A 


N/A 
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a. Scatter/Gather list is allowed to have zero elements. In the case of a Reliable Datagram "I 
Service Type queue, the maximum number of elements is two. 2 

Note: If the Service Type is not mentioned in a field in the above table, the modifier is not 2 
applicable for that Service Type. 

4 

Input Modifiers: 5 

6 

This is the full list of modifiers for all of the operations available on the y 
Send Queue. Not all modifiers can be used for all queue or operation g 
types. See Table 83 Operation Type Matrix and Table 84 : Work Re- 
quest Modifier Matrix for details on which modifiers may be used for 
the specified queue and operation types. 

• HCA handle. 



QP handle. 



9 
10 
11 
12 
13 

The Work Request containing the information required to perform ^4 
the request. The modifiers that must be specified are dependent 
on the operation type specified. The Work Request is defined as 
follows: 

17 

• A user defined 64-bit Work Request ID. ^ g 

• Operation type. Valid operation types for Work Requests sub- ^ g 
mitted to the Send Queue are: 20 

• Send 21 

• RDMA Read 22 

• RDMA Write 23 

24 

• Compare & Swap (assuming the HCA supports atomic 
operations) 

26 

• Fetch & Add (assuming the HCA supports atomic 
operations) 

28 

Completion notification type. Must be specified and is only 29 
valid if the Send Queue was set up for selectable signaling. 

30 

Scatter/Gather list. The scatter/gather list can contain zero or 
more Data Segments. The list is specified only for Send and 
RDMA operations. 



32 
33 

Number of Data Segments in the scatter/gather list. This mod- 
ifier is used only when the scatter/gather list must be speci- 
fied. 



4-byte Immediate Data. Valid only for Send or Write RDMA 
operations. 



35 
36 
37 
38 
39 
40 
41 
42 
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• Fence indicator. If the fence indicator is set, then all prior 1 
RDMA Read and Atomic Work Requests on the queue must 2 
be completed before starting to process this Work Request. 3 
The Fence indicator only has an effect with the Reliable Con- ^ 
nection and Reliable Datagram transport services. 

5 

Remote node address, required only for operations on Raw or g 

IB Datagram Service Types. ^ 

QP number of the destination QR Required only for opera- q 
tions on IB Datagram Service Types. g 

• The Q_Key for the destination QR Required only for opera- 1 0 
tions on IB Datagram Service Types. See 10.2.4 Q Keys on 
page 376 for more detail on how the CI determines which 
Q_Key to insert in the packet. ^ ^ 

• Ethertype associated with the Work Request. Required only 

for Raw Ethertype QPs. , ^ 

1 5 

EE Context. Required only for Reliable Datagram QPs. Note 
that this is the EE Context number and not the EE Context 
Handle. 

18 

• Solicited Event Indicator. Valid only for Sends or RDMA ^ g 
Writes with immediate data. 

20 

• Remote address specified by an address and R_Key. Re- 21 
quired and used only for RDMA and atomic operations. For ^2 
Atomic operations, the address must point to a location that is 
64-bit aligned. 

• Atomic operation operands. If an atomic operation is speci- 
fied, the following additional operands must be supplied: 

1st 64-bit operand. Must be aligned on a 64-bit boundary. 
It is the value to compare against for Compare & Swap. 
The value to add for Fetch & Add. 28 

29 

30 



24 
25 
26 
27 



• 2nd 64-bit operand. Must be aligned on a 64-bit boundary. 
This value replaces the previous contents of the remote 
address if the first argument equals the content of the 31 
location in the Compare & Swap. Ignored for Fetch & Add. 32 

• A local Data Segment where a copy of the original ^3 
contents of the remote memory operation will be 34 
deposited after the atomic operation completed at the 35 
remote endnode. 36 

Output Modifiers: 37 

38 

• Verb Results: 39 

Operation completed successfully. 40 

• Invalid HCA handle. 41 

42 
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Invalid QP handle. 1 

Too many Work Requests posted. 2 

3 

Invalid operation type. 

Invalid QP state. ^ 

0 

Note: This error is returned only when the QP is in the Reset, g 
Init, or RTR states. It is not returned when the QP is in the Error ^ 
state due to race conditions that could result in indeterminate 
behavior. Work Requests posted to the Send Queue while the 
QP is in the Error state are completed with a flush error. ^ 

Invalid completion notification type. 

Invalid Scatter/Gather list format. 



11.4.1.2 Post Receive Request 



Input Modifiers: 



A user defined 64-bit Work Request ID. 



10 
11 
12 

Invalid Scatter/Gather list length. 1 3 

Atomic operations not supported. 14 

Invalid address handle. ^ ^ 

16 

17 

Description: 

Builds a WQE for the Receive Queue in the specified QP from the in- 
formation contained in the Work Request submitted by the Consumer. 20 
This WQE is added to the end of the Receive Queue and the HCA is 21 
notified that a new WQE is ready to be processed. 22 

Control returns to the Consumer immediately after the WQE has been 23 

submitted to the Receive Queue and the HCA has been notified that 24 

a new WQE is ready to process. When control returns, the Work Re- 25 

quest is in the scope of the Consumer and will no longer be modified 26 

or accessed below the Channel Interface. 27 

01 1 -27: The CI shall return control to the Consumer immediately after the 28 
Work Request has been submitted to the Receive Queue. 29 

30 
31 

HCA handle. 32 
QP handle. 

34 

The Work Request containing the information required to perform 
the request. The modifiers that must be specified are dependent 
on the operation type specified. The Work Request is defined as 
follows: 37 

38 
39 

Operation type. The only valid operation for the Receive 
Queue is the Receive operation. 

41 
42 
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If an entry is present, the Work Connpietion at the head of the CQ is 
returned to the Consumer. 



7 
8 
9 
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Scatter/Gather list. This list can contain zero or more Data 1 
Segments. 2 

Note that for Raw IPv6 and UD QPs, the first 40 bytes of the 3 
buffer(s) referred to by the Scatter/Gather list will contain the 4 
GRH of the incoming message. If no GRH is present, the con- 5 
tents of first 40 bytes of the buffer(s) will be undefined. The g 
presence of the GRH will be indicated by a bit in the Work 
Completion. 

• Number of Data Segments in the scatter/gather list. 

Output Modifiers: 1 q 

11 
12 

Operation completed successfully. ^3 
Invalid HCA handle. 14 

Invalid QP handle. 15 

1 

Too many Work Requests posted. 

Invalid operation type. ^ g 

Invalid QP state. >ig 
Invalid Scatter/Gather list format. 20 
Invalid Scatter/Gather list length. 21 
11-4.2 Completion Queue Operations 

23 

11.4.2.1 POLL for Completion 24 

Description: 25 

Polls the specified CQ for a Work Completion. A Work Completion in- 
dicates that a Work Request for a Work Queue associated with the CQ ^7 
is done. 28 

29 
30 
31 
32 



The following table defines, classifies and associates wire level pro- 
tocol NAK codes with completion errors that are possible on Work Re 
quests posted to the Send Queue. Completion errors are returned 
through the completion queue as work completions. 34 

35 
36 
37 
38 
39 
40 
41 
42 
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C11-28: The CI shall return completion errors for a Work Request in the 
associated Work Completion for errors described in Table 85 Completion 
Error Types for Send Queues . 

Table 85 Completion Error Types for Send Queues 



ErrAr Tun a 

error lyps 


Completion 
Type 


Transport Errors returned by 
responder (RC) 


Transport Errors sent by 
responder (RD) 


Local Length 


Interface 


N/A 


N/A 


Local Operation 


Interface 


N/A 


N/A 


Local Operation 


Processing 


N/A 


Optional NAK - Invalid Request 


Local Protection 


Interface 


N/A 


N/A 


Local Protection 


Processing 


N/A 


Optional NAK - Invalid Request 


Work Request Flushed 


Processing 


N/A 


N/A 


Memory Window Bind 


Interface 


N/A 


N/A 


Remote Access 


Processing 


NAK - Remote Access Violation 


NAK - Remote Access Violation 


Remote Operation 


Processing 


NAK - Remote Operational Error 


NAK - Remote Operational Error 


Remote Invalid Request 


Processing 


NAK - Invalid Request 


NAK - Invalid Request 


Remote Invalid RD 
Request 


Processing 


N/A 


NAK - Invalid RD Request 


RNR NAK Counter 
Exceeded 


Processing 


NAK - RNR 


NAK - RNR 


Transport Timeout Retry 
Count Exceeded 


Processing 


N/A 


N/A 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



A Remote Q_Key violation and a Remote RDD IVIismatch will both result 
in an Invalid RD Request completion error type for the requester's WQE. 
Since the same NAK code is returned in both cases, it is not possible for 
the requester to distinguish between them. 

The following table defines, classifies and associates wire level protocol 
NAK codes with completion errors that are possible on Work Requests 
posted to the Receive Queue. Completion errors are returned through the 
completion queue as work completions. 



InfiniBand^'^ Trade Association 



Page 503 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 Software Transport Verbs October 24, 2000 

VOLUME 1 - General Specifications FINAL 

C11-29: The CI shall generate the completion en-ors based on the NAK 1 
codes as shown in Table 86 Completion Error Types for Receive Queues . 2 



Table 86 Completion Error Types for Receive Queues 



3 
4 
5 
6 
7 
8 
9 

10 
11 
12 

Input Modifiers: ^3 

14 
15 
16 

Output Modifiers: 17 

18 
19 



Error Type 


Completion 
Type 


Transport Errors sent to 
Requester 

(RC and RD) 


Local Length 


Processing 


NAK - Remote Operational Error 


Local Protection 


Processing 


NAK - Remote Operational Error 


Local Operation 


Processing 


NAK - Remote Operational Error 



HCA handle. 
CQ handle. 



The Work Completion containing information relating to the com- 
pleted Work Request if an entry is present on the CQ. If the status 
of the operation that generates the Work Completion is anything 
other than success, the contents of the Work Completion are un- 21 
defined except as noted below. The contents of a Work Comple- 22 
tion are: 23 

• The 64-bit Work Request ID set by the Consumer in the asso- 24 
ciated Work Request. This is always valid, regardless of the 25 
status of the operation. 26 

• The operation type specified in the completed Work Request. 27 

• The valid operation types are: 28 

• Send (for WRs posted to the Send Queue) 

30 

• RDMA Write (for WRs posted to the Send Queue) 

• RDMA Read (for WRs posted to the Send Queue) 32 

• Compare and Swap (for WRs posted to the Send Queue) 33 

Fetch and Add (for WRs posted to the Send Queue) 34 

35 

Memory Window Bind (for WRs posted to the Send 
Queue) 

37 

• Send Data Received (for WRs posted to the Receive 
Queue) 

39 

RDMA with Immediate Data Received (for WRs posted to 
the Receive Queue) 

41 

• The number of bytes transferred. 
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The number of bytes transferred is returned in Work Comple- 
tions for Receive Work Requests for incoming Sends and 
RDMA Writes with Immediate Data. This does not include the 
length of any immediate data. 

The number of bytes transferred is returned in Work Comple- 
tions for Send Work Requests for RDMA Read and Atomic Op- 
erations. 

In the case of Raw IPv6 and UD QPs, the number of bytes 
transferred is the payload of the message plus the 40 bytes re- 
served for the GRH. The 40 bytes is always included, whether 
or not the GRH is present. 

Immediate data indicator. This is set if immediate data is 
present. 

4-byte immediate data. 

Remote node address and QP. Returned only for Datagram 
services. The address information returned for incoming Dat- 
agrams is shown in Table 87 Datagram addressing informa- 
tion . 

GRH Present indicator, for Raw IPv6 and UD QPs only If this 
indicator is set, the first 40 bytes of the buffer(s) referred to by 
the Scatter/Gather list will contain the GRH of the incoming 
message. If it is not set, the contents of first 40 bytes of the 
buffer(s) will be undefined. Contents of the payload of the 
message will begin after the first 40 bytes. 

Table 87 Datagram addressing information 



Reliable 
Datagrams 


Unreliable 
Datagrams 


Raw IPv6 


Raw Ethertype 


16-bit SLID 


16-bit SLID 


16-bit SLID 


16-bit SLID 


4-bit SL 


4-bit SL 


4-bit SL 


4-bit SL 


24-bit Source QP 


24-bit Source QP 




16-bit Ethertype 


24-bit local EE 
Number 


DLID Path Bits 


DLID Path Bits 


DLID Path Bits 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



P_Key index, for GSI only. 

Status of the operation. This is always valid. 

These status codes are covered in 11.6.2 Completion 
Return Status , with NAK codes reported according to 
Table 85 Comoietion Error Types for Send Queues and 
Table 86 Completion Error Types for Receive Queues . 
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Freed Resource Count (see 10.8.5.1 Freed Resource Count 1 
on page 425 ). This is always valid, regardless of the status of 2 
the operation. 3 

• Verb Results: 4 

• Operation completed successfully. 5 

6 
7 
8 

• CQ empty. g 

1 1 .4.2.2 Request Completion Notification i o 

Description: 11 

12 

Requests the CQ event handler be called when the next completion 13 
entry of the specified type is added to the specified CQ. The handler 
is called at most once per Request Completion Notification call for a 
particular CQ. Any CQ entries that existed before the notify is enabled 
will not result in a call to the handler. 



Invalid HCA handle. 
Invalid CQ handle. 



When the Consumer requests completion notification, it must specify 
whether the notification callback is invoked for either: 



14 
15 
16 
17 
18 



Completion Events are one of two types: solicited or unsolicited. A 
Solicited Completion Event occurs when an incoming Send or RDMA 
Write with Immediate Data message, with the Solicited Event header 
bit set causes a successful Receive Work Completion to be added to 20 
a CQ; or, when an unsuccessful Work Completion is added to a CQ. 21 
An Unsolicited Completion Event occurs when any other successful 22 
Receive Work Completion, or any successful Send Work Completion, 23 
is added to a CQ. 24 

C11-30: The CI shall support both solicited and unsolicited Completion 25 
Event Types. 26 

27 
28 
29 

• the next Solicited Completion Event only, or 

• the next Solicited or Unsolicited Completion Event. 3^ 

If a Request Completion Notification is pending, subsequent calls to 32 
Request Completion Notification for the same CQ prior to the comple- 33 
tion event affect only when the notification occurs. A Request Comple- 34 
tion Notification for the next completion event takes precedence over ^5 
a Request Completion Notification for a solicited event completion for 
the same CQ. 

37 

If multiple calls to Request Completion Notification have been made 
for the same CQ and at least one of the requests set the type to the 
next completion, the CQ event handler will be called when the next 
completion is added to that CQ. The CQ event handler will be called 40 

41 

42 
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only once, even though multiple CQ notification requests were made 1 

prior to the completion event for the specified CQ. 2 

Once the CQ event handler is called, another completion notification re- 3 

quest must be registered before the CQ event handler will be called again. 4 

5 

011-31: When a completion notification request is outstanding on a CQ g 

for a solicited completion type and another request for that CQ is made j 
that specifies a notification for the next completion, the CI shall change 

the outstanding completion notification type to the next completion. ^ 

011-32: When a completion notification request is outstanding on a CQ 10 

for the next completion and another notification request for that CQ is 11 

made, the CI shall not change the outstanding completion notification 12 

type. ^3 

14 

A CQ event handler must be specified prior to calling this routine. If the 

CQ event handler has not been registered when the event is gener- ^ ^ 

ated, the handler call will not be made. 1 6 



Type of completion notification requested. The type is either the 
next completion or when a solicited completion occurs. 



Verb Results: 

• Operation completed successfully. 



17 
18 



When the CQ event handler is called, it only indicates a new entry was 
added to the specified CQ. The HCA and CQ handles are passed to 
the CQ event handler so the CQ event handler can determine which 1 9 
CQ caused it to be called. 20 

Once the handler routine has been invoked, the Consumer must call 
Request Completion Notification again to be notified when a new entry 22 
is added to that CQ. 23 

It is the responsibility of the Consumer to call the Poll for Completion 
Verb to retrieve a Work Completion. 25 

Input Modifiers: 



26 
27 

HCA handle. 28 
CQ handle. 



29 
30 
31 
32 

Output Modifiers: 22 

34 
35 
36 

Invalid HCA handle. 37 
Invalid CQ handle. 38 

Invalid completion notification type. 39 

40 
41 
42 
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11.5 Event Handling i 

11.5.1 Set Completion Event Handler 2 

3 

Description: 

4 

Registers a CQ event handler. Only one CQ event handler can be reg- 5 
istered per HCA. Additional calls to this Verb will ovenA/rite the handler 6 
routine to be called. Additional calls will not generate an additional 7 
handler routine. 8 

This call does not automatically request a notification on a connpletion 9 
event. The Request Connpletion Notification Verb must be called in 10 
order to request notification. 

The parameters passed to the CQ event handler are: 12 

• HCA handle. 

14 

• CQ handle. 

15 

Input Modifiers: -,g 

• HCA handle. 

18 

Handler address to call. ^ g 

Output Modifiers: 20 

21 

• Verb Results: 22 

Operation completed successfully. 23 

• Invalid HCA handle. 24 

11.5.2 Set Asynchronous Event Handler 25 

Description: 

27 

Registers the asynchronous event handler. Only one asynchronous 28 
event handler can be registered per HCA. Additional calls to this Verb 29 
will overwrite the handler routine to be called. Additional calls will not 39 
generate an additional handler routine. 3^ 

011-33: The CI shall use the asynchronous event handler specified in 32 
this Verb even in the case where an existing asynchronous event handler 33 
has already been registered. 3^ 

35 

After the asynchronous event handler is registered, all subsequent 
asynchronous events will result in a call to the handler. Until an asyn- 
chronous event handler is registered, asynchronous events will be 37 
lost. 38 

The parameters passed to the asynchronous event handler are: 

40 

• HCA handle. 

41 
42 
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• Event record. This contains information which indicates the 1 
resource type and identifier as well as which event occurred. 2 
See 11,6.3 Asvnchronous Events for nnore information. 3 

Input Modifiers: 4 

5 

• HCA handle. g 

• Handler address. 7 

Output Modifiers: 8 

9 

• Verb Results: 10 

• Operation completed successfully. 11 

• Invalid HCA handle. ^2 

13 

11.6 Result Types 14 
11.6.1 Immediate Return Results 

1 6 

This section contains a list of the possible Verb return results. All results 
except "Operation completed successfully" are due to interface errors in 
the Immediate Error category. Not all Verbs return all results. 18 

19 

Successful return result: 20 

Operation completed successfully. 21 

00 

Resource errors: 

23 

Insufficient resources to complete request. ^4 

• Number of CO entries requested exceeds HCA capability. 25 

Maximum number of Work Requests requested exceeds HCA ca- 26 

pability. 27 

• Maximum number of scatter/gather elements requested exceeds 28 
HCA capability. 29 

• Too many Work Requests posted. 30 

• Number of available Raw Datagram QPs exceeded. 31 

32 

• Number of QPs attached to multicast groups exceeded. 

• HCA already in use. 24 
HCA attribute errors: 35 

• Invalid HCA name. 36 

• Invalid HCA handle. 3^ 

• MTU of HCA port exceeded. 

39 

Invalid Port. 

Invalid Counter specified. 41 

42 
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Invalid Counter value. 1 

Address errors: 2 

3 

Invalid Address handle. 

4 

QP errors: ^ 
Invalid QP handle. g 

• Cannot change QP attribute. 7 

• Invalid QP state. ^ 

9 

• Invalid Service Type for this QP. 

10 

QP is already in use. 

• Atomic operations not supported. 1 2 
Raw Datagrams not supported. 13 
Reliable Datagrams not supported. 

15 

Invalid operation type. ^ ^ 

Invalid Scatter/Gather list format. ^ j 

Invalid Scatter/Gather list length. ^ g 

Invalid path migration state. 19 

Invalid Special QP type. ^0 

21 

• Invalid Address Handle 

22 

More outstanding entries on WQ than size specified. 23 
CQ errors: 24 

• Invalid CQ handle. 25 

• More outstanding entries on the CQ than size specified. 26 

27 

• One or more Work Queues still associated with the CQ. 

28 

• CQ empty. 29 
Invalid completion notification type. 39 

EE Context errors: 31 
Invalid EE Context handle. ^2 

• Invalid EE Context state. 

34 

Cannot change EE Context attribute. ^5 
QP or EE Context errors: 35 
Invalid path migration state. 37 

Reliable Datagram Domain is in use. 38 

39 

Invalid Reliable Datagram Domain. 

40 

• Invalid RNR NAK Timer Field value. . . 

41 

Memory operation errors: 42 
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Invalid Protection Domain. 1 

• Protection Domain is in use. 2 

3 

Invalid Virtual Address. 

4 

• Invalid Length. ^ 
Invalid Physical Buffer List entry. g 

• Invalid Offset. 7 

• Invalid L_Key. 8 

• Invalid R Key. ^ ^ 

10 

• Invalid Physical Buffer List entry. 

Invalid Memory Region handle. 12 
Invalid Memory Window handle. 13 

• Invalid Access Control specifier. ''4 

1 5 

• Operation denied; Region still has bound Window{s) 
Multicast errors: 

Invalid multicast DLID. 18 
Invalid Multicast group IPv6 Address. 19 

Partition table errors: 20 

21 

P_Key index out of range. 

P_Key index specifies invalid entry in the P_Key table. 23 
11.6.2 Completion Return Status 24 

Describes the possible Work Completion status error return results. 25 



These are errors that occur during the processing of a Work Request and 26 
can be reported in the Work Completion status. 27 

28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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• Success • Operation completed successfully. 1 

• Local Length Error - Generated for a Work Request posted to the lo- 2 
cal Send Queue when the sum of the Data Segment lengths exceeds 3 
the message length for the channel. Generated for a Work Request 4 
posted to the local Receive Queue when the sum of the Data Seg- 5 
ment lengths is too small to receive a valid incoming message. g 

Local Operation Error - An internal consistency error was detected 7 
while processing this Work Request. 3 

• Local Protection Error - The locally posted Work Request's Data Seg- 9 
ment does not reference a Memory Region that is valid for the re- 
quested operation. 

• Work Request Flushed Error - A Work Request was in process or 12 
outstanding when the QP transitioned into the Error State. ^ 3 

Memory Window Bind Error - The Verbs Consumer had insufficient 14 
access rights. 15 

The following errors are reported only for Reliable QPs. 16 

17 

Remote Invalid Request Error - The responder detected an invalid -jg 
message on the channel. Possible causes include the operation is 
not supported by this receive queue, insufficient buffering to receive a 
new RDMA or Atomic Operation request, or the length specified in an 
RDMA request is greater than 2^^ bytes. 2^ 

22 

In the case where the buffer size is insufficient to handle the request, 
the number of bytes transferred into the buffer is indeterminate. How- 
ever, the CI shall not write beyond the buffer bounds. 24 

OR 

• Remote Access Error - A protection error occurred on a remote data 
buffer to be read by an RDMA Read, written by an RDMA Write or ac- 26 
cessed by an atomic operation. This error is reported only on RDMA 27 
operations or atomic operations. 28 

Remote Operation Error - The operation could not be completed sue- 29 

cessfully by the responder. Possible causes include a responder QP 30 

related error that prevented the responder from completing the re- 31 

quest or a malformed WQE on the Receive Queue. 32 

• Transport Retry Counter Exceeded - The local transport timeout retry 33 
counter was exceeded while trying to send this message. 34 

RNR Retry Counter Exceeded - The RNR NAK retry count was ex- 35 
ceeded. 36 

The following errors are reported only for RD QPs or EE Contexts. 37 

38 
39 
40 
41 
42 
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Remote Invalid RD Request - The responder detected an invalid in- 1 
conning RD message. Causes include a Q_Key or RDD violation. 2 

Invalid EE Context Number - An invalid EE Context number was de- 3 
tected. 4 

• Invalid EE Context State - Operation is not legal for the specified EE 5 
Context state. 6 

7 
8 

11.6.3 Asynchronous Events 9 

This section describes the asynchronous events. Asynchronous events 1 0 
are separated into three categories; Affiliated asynchronous events, AfTil- 11 
iated asynchronous errors and Unaffiliated asynchronous errors. Both 12 
kinds of asynchronous errors are defined in 10.10.2.3 Asynchronous Er- ^3 
rors on page 435 . 

1 5 

Affiliated asynchronous events have been separated into two categories 
because the behavior of the QP/EE Context when the events occur are ^ ^ 
different. 17 

18 

011-34: When an affiliated asynchronous error occurs, the CI shall cause ^ g 
the QP/EE to transition to the Error state. 20 

21 

011-35: When an affiliated asynchronous event occurs, the CI must 
leave the QP/EE in the QP/EE State that it was in when the asynchronous 
event occurred. 23 

24 

Unaffiliated asynchronous errors are those which cannot be associated 25 
with a specific QP or EE Context. 26 

27 

The Verbs Consumer must register a handler as described in 11.5.2 Set 
Asynchronous Event Handler to be notified that an asynchronous event 
has occurred. This mechanism is used to collect information about both 29 
events and the errors. 30 

31 

11.6.3.1 Affiliated Asynchronous Events 32 

Affiliated asynchronous events are advisories to the Verb Consumer that 33 
the specified event has occurred on the specified QP or EE Context. 34 
Events in this category are not considered to be errors by the Channel In- 3^ 
terface, so the QP/EE state remains unchanged. 

Path Migrated - Indicates the connection has migrated to the alter- 
nate path. 38 

on 

Communication Established - Indicates the first packet has arrived for 
the Receive Work Queue where the QP/EE is still in the RTR state. ^0 
The handle of the QP/EE, which was the destination of this packet is 41 

42 
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returned in the event record. This event may be used by the Commu- 1 

nication Manager as shown in the state diagram in 12.9.6 Communi- 2 

cation Establishment - Passive on oaoe 546 and described in CM 3 

12.9.7.2 Passive States . , 
4 

C11-36: The CI shall generate a Communication Established asynchro- 5 

nous event when the first packet arrives for the Receive Work Queue g 

when the QP/EE is still in the RTR state. ^ 

Send Queue Drained - Indicates that the Send Queue of the sped- ^ 

fied Queue Pair has completed the outstanding messages in 9 

progress when the state change was requested and, if applicable, 10 

has received all acknowledgements for those messages. 11 

11.6.3.2 Affiliated Asynchronous Errors 1 2 

• CQ Error - Indicates an error occurred when writing an entry to the ^ ^ 

Completion Queue. 14 

011-37: The CI shall generate a CQ Error when an error, other than CQ 

overrun, occurs while writing an entry to the CQ. 16 

17 

011-38: The CI shall generate a CQ Error when a CQ overrun is de- is 

tected. ^9 



20 
21 



This condition will result in an Affiliated Asynchronous Error for any as 
sociated Work Queues when they attempt to use that CQ. Comple- 
tions can no longer be added to the CQ. It is not guaranteed that ^2 
completions present in the CQ at the time the error occurred can be 23 
retrieved. Possible causes include a CQ overrun or a CQ protection 24 
error. 25 

Local Work Queue Catastrophic Error - An error occurred while ac- 26 
cessing or processing the Work Queue that prevents reporting of 27 
completions. 28 

011-39: The CI shall generate a Local Work Queue Catastrophic Error 29 
when a Work Queue associated with a CQ that caused the CQ Error to be 30 
generated attempts to use that CQ. 3^ 

32 

011-40: The CI shall generate a Local Work Queue Catastrophic Error 
when an error occurred while accessing or processing the Work Queue 
that prevents reporting of completions. 34 

35 

Path Migration Request Error - Indicates the incoming path migration 35 
request to this QP/EE was not accepted. The validation process is 37 
defined in section 17.2.8.1.2 Migration Request . 

oil -6: If the CI supports automatic path migration, the CI shall generate 39 
a Path Migration Request Error when the incoming path migration request 4Q 
to this QP/EE was not accepted. 

41 
42 
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11.6.3.3 Unaffiliated Asynchronous Errors 1 

• Local Catastrophic Error - An error occurred which cannot be attribut- 2 

able to any resource and CI behavior is indeternninate. 3 

011-41: The CI shall generate a Local Catastrophic Error when an error ^ 

occurred which cannot be attributable to any resource and CI behavior is 5 

indeterminate. 6 

7 

Port Error - Issued when the link is declared unavailable. g 

011-42: The CI shall generate a Port Error when the link is declared un- g 
available. 



10 
11 
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13 
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Chapter 12: Communication Management 



12.1 Overview 




Figure 119 Communication Management Entities 

Communication l\/lanagement encompasses the protocols and meclia- 
nisms used to establish, maintain, and release channels for the IB Reli- 
able Connection, Unreliable Connection, and Reliable Datagram 
transport service types. The Service ID Resolution Protocol (see section 
12.11 ) enables users of Unreliable Datagram service to locate Queue 
Pairs supporting their desired service. 

Connections are managed over Queue Pairs other than those used for the 
connection, through the protocol described herein, between the Commu- 
nication Managers (CMs) on each system. (See Figure 119) The CMs 
communicate using Management Datagrams (MADs), typically over the 
General Services Interface (GSI) on each system. This document defines 
CM external behaviors, but internal interfaces and implementations are 

outside the scope of the InfiniBand™ Architecture specification. Exam- 
ples are intended to enable understanding, not to specify implementation. 

At creation, QPs and EECs are not ready for communication. The at- 
tributes of the QP/EEC must be modified (see sections 11.2.3.2 and 
11.2.6.2 ) to support the desired communication characteristics and 
target(s). 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^*^ Trade Association 



Page 516 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Communication IVIanagement 



October 24, 2000 
FINAL 



Due to their nature, raw packet QPs do not need, and are not supported 
by, IB communication management. 

The requirements on participating CMs are not equal. The initiating CM 
is responsible for collecting or calculating most of the information neces- 
sary to establish the connection. Much of the raw information is available 
from Subnet Administration, but some adjustments may be desirable, de- 
pending on the application of the channel. 

CMs must maintain a certain amount of information for the lifetime of a 
connection. Details may be found in section 12.9.9 . 

12.2 Establishment 



A may send 



Client 



Server 

I B 



REQ 



REP 



RTU 



REQ: Request 



REP: Reply: Request Accepted 



RTU: Ready To Use 



B may send 

Figure 120 Sample Connection Establishment Sequence 



Two models are supported by the Connection Establishment protocol: Ac- 
tive/Passive (Client/Server), and Active/Active (Peer to Peer). 

As seen in Figure 119, the CMs on each system establish connections on 
behalf of their clients. The interactions between CMs and their clients are 
outside the scope of this specification. 

In the Active/Passive model (shown in Figure 120), B's CM waits for con- 
nection requests on behalf of a server (e.g., a server process or an I/O 
controller) that waits (passive) for connections to be established by cli- 
ents. The CM (A) for a prospective client (active) places the ServicelD that 
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designates the desired service in the Request (REQ) message that be- 
gins the connection establishment sequence. The Service! D allows the 
passive-side CM (B) to associate the request with the appropriate server 
entity. Should the REQ be accepted, B's CM returns the Queue Pair 
Number (QPN) (and End to End Context Number (EECN) for RD service) 
in a Response (REP) MAD. Whether QPs and EECs are pre-allocated or 
are allocated in response to a request is an implementation consideration 
that is outside the scope of the IBA specification. 

In the Active/Active model, both entities begin as active (i.e., both send 
REQ), but one ultimately takes the passive role for establishing the con- 
nection. The selection of the passive entity is described in section 12.10.4 



12.3 Automatic Path Migration 



12.4 Release 



The connection establishment messages specify the information neces- 
sary to support an (optional) alternate pair of endpoints to support Auto- 
matic Path Migration (APM). APM is described in section 17.2.8.1 and the 
support mechanisms are described in section 10.4 . Channel Adapters 
that do not support APM may ignore the Alternate address information. 



Connections are released through the exchange of Disconnect Request 
(DREQ) and Disconnect Reply (DREP) MADs. Communicating entities 
will likely wish to effect an orderly shutdown of their protocol before initi- 
ating the Disconnect sequence. After a connection is released, the CM 
shall cause each involved QP/EEC to be placed into the TimeWait state 
as defined in section 12.9.8.4 . 

CMs shall maintain enough connection state information to detect an at- 
tempt to initiate a connection on a remote QP/EEC that has not been re- 
leased from a connection with a local QP/EEC, or that is in the TimeWait 
state. Such an event could occur if the remote CM had dropped the con- 
nection and sent DREQ, but the DREQ was not received by the local CM. 
If the local CM receives a REQ that includes a QPN (or EECN if 
REQ:RDC Exists is not set), that it believes to be connected to a local 
QP/EEC, the local CM shall act as defined in section 12.9.8.3 . 



12.5 Service Types 

12.5.1 Supported Protocols 



The sections that follow contain message descriptions and state diagrams 
specifying how those messages are exchanged. The messages are used 
for the following purposes: 

• To support connection establishment for RC and DC service 
types. 
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To support end to end context establishment for RD service. 1 

Fioure 119 illustrates the following relationships. 2 

3 

12.5.2 Connected Services 4 

A channel is established for Reliable Connected (RC) and Unreliable Con- 5 
nected (UC) service types by reaching agreement between the end CMs. 6 

7 

12.5.3 Unreliable Datagram Service 8 

Unreliable Datagram (UD) service allows a message to be sent to any 9 

destination, although there is no guarantee that the destination will re- io 
ceive or accept it. The ServicelD resolution facility (Section 12.11 ) may 

be used to determine the appropriate target QR ^2 

13 



12.5.4 Reliable Datagram 



12.6.1 Required Messages 



Request for Communication (REQ) (Section 12.6.5 ) 



14 
15 



Reliable Datagram (RD) service allows multiple Queue Pairs to communi- 
cate over a single RD channel (defined by a pair of EE contexts). One QP 
on each end is specified when an RD channel is established. A pair of 
applications using these QPs that wish to use additional QPs over that 17 
RDC do not need to use CM to associate those QPs. Application-specific 1 8 
messages could be sent over the original QPs to notify the other side of 
the QPNs of the new QPs. 20 



21 
22 



Unless othen/vise specified, an RD communication request implies the 
creation of a new RDC. Setting the RDC Exists field in the REQ mes- 
sage allows the reuse of the specified RDC. (See secfion 12.6.5 ) 23 

24 

12.6 Communication Management Messages 25 

The following sections describe the set of messages used to support the 26 
communicafion establishment scenarios supported by the IBA: 27 

28 

a) Active client to passive server 29 

b) Active client to active client 30 

c) Active client to passive server (with third-party redirector) 

32 
33 

All IBA hosts and all IBA targets that support RC, UC, or RD service types 34 
shall support the following messages: 25 



36 
37 

Message Receipt Acknowledgement (MRA) (Section 12.6.6 ) All 
IBA hosts and targets are required to be able to receive and act 
upon an MRA, but the ability to send an MRA is optional. 

40 

Reject (REJ) (Section 12.6.7 ) 

Reply to Request for Communication (REP) (Section 12.6.8 ) 42 
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• Ready to Use (RTU) (Section 12.6.9 ) 1 

• Request for Communication Release (DREQ) (Section 12.6.10 ) 2 

3 

• Reply to Request for Communication Release (DREP) (Section 
12.6.11 ) 4 

5 

C12-1 : A CA that supports Reliable Connected, Unreliable Connected, or 
Reliable Datagram channels shall support their establishment using the 
CM protocol. 

8 

012-2: For the states and messages it supports, a CM shall adhere to the 9 
CM protocol as defined in sections 12.97 and 12.9.8 . 10 

11 

01 2-3: CM message contents shall conform to the field descriptions in 
section 12.7 . 

13 

012-4: A CM shall support sending the REJ message in accordance with 1^ 
section 12.6.7 . 1 5 

16 

012-5: A CM shall, upon receipt of an MRA message, behave in accor- -jy 
dance with section 12.9.8.5 . ^ g 

1 Q 

o1 2-1 : If a CM sends the REQ message, it shall do so in accordance with 
section 12.6.5 . 20 

21 

o1 2-2: If a CM sends the MRA message, it shall do so in accordance with 22 
section 12,6.6 . 23 

24 

o12-3: If a CM sends the REP message, it shall do so in accordance with 
section 12.6.8 . 

26 

o12-4: If a CM sends the RTU message, it shall do so in accordance with 27 
section 12.6.9 . 28 

29 

o12-5: If a CM sends the DREQ message, it shall do so in accordance 30 
with section 12.6.10 . 

0I2-6: If a CM sends the DREP message, it shall do so in accordance 
with section 12.6.11 . 33 

34 

o12-7: If a CM initiates connection requests (active role), it shall support 35 
sending the REQ, RTU, DREQ, and DREP messages, and responding to 35 
the REP, DREQ, and DREP messages. 3^ 

38 

01 2-8: If a CM accepts connection requests (passive role), it shall support 
responding to the REQ, RTU, and DREQ messages, and sending the 39 
REP and DREP messages. 40 

41 
42 
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o12-9: If a CM sends the DREQ message, it shall be able to handle the 
DREP message. 



12.6.2 Conditionally Required Messages 



12.6.3 Optional Messages 



12.6.4 Message Usage 



Support for these messages is required if non-management services are 
provided on the Channel Adapter at other than fixed QPNs. Management 
services include those provided through Subnet Management Packets 
(see 14.2 Subnet Management Class ) or through General Management 
Packets (see Chapter 16: General Services ). 

• Service ID Resolution Request (SIDR_REQ) (Section 12.11.1 ) 

• Service ID Resolution Response (SIDR_REP) (Section 12.11.2 ) 

0I2-IO: If a CM sends the SIDR_REQ message, it must do so in accor- 
dance with section 12.11.1 . 

0I2-II : If a CM sends the SIDR_REP message, it must do so in accor- 
dance with section 12.11.2 . 

012-12: If a CA provides services (other than Subnet Management and 
General Services) using the UD service type at other than fixed QPNs, its 
CM must support receiving, processing and replying to the SIDR_REQ 
message as specified in section 12.11 . 



Support for these messages is optional: 

• Load Alternate Path (LAP) (Section 12.8.1 ) 

• Alternate Path Response (APR) (Section 12.8.2 ) 

012-13: If a CM accepts REQ messages and agrees to perform Auto- 
matic Path Migration, it shall support receiving, processing and replying 
to the LAP message as specified in section 12.8 . 

012-14: If a CM sends REQ messages with Altemate Port/Path informa- 
tion, it shall support sending the LAP message as specified in section 
12.8 . 



Connected Transport Service Types require state information to be estab- 
lished, maintained, and released at both ends of the connection. Con- 
sumers can use the messages described in this section for that purpose. 

By definition, unreliable datagram communications do not require any 
connection state to be established, maintained, or released. However, 
communication services are provided to allow local and remote QPs to be 
associated based on a specific Service ID. (See section 12.11 ) 
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Reliable datagram communication requires Reliable Datagram Channels 1 
to be created, maintained, and released between CAs. 2 



The Communication Management information contained in each Man- 
agement Datagram message is described below. The MAD header format 
is defined in 16.7.1 MAD Format on page 786 . 

The messages defined below are used for both establishing connections 
and end to end context establishment. The message definitions are the 
union of the fields required for both of these purposes, and therefore there 
are some fields in the messages which are useful for connection estab- 
lishment but not for end to end context establishment, and vice versa. 
This is done to decrease the total number of message types in the pro- 
tocol. For each field in a message, whether the field is intended to support 
connection establishment or end to end context establishment (or both) is 
noted. 

12.6.5 REQ - Request for Communication 

REQ is sent to initiate the communication establishment sequence. The 
initiator (REQ sender) provides the Port Address (GID and/or LID) and the 
Queue Pair Number that it will be using for its end of the channel. For Re- 
liable Datagram Channel establishment, the EE Context Number is in- 
cluded. 

The initiator is responsible for proposing the Port Addresses (Primary and 
optional Alternate) that the target (REQ recipient) is to use for the channel. 
Based on the path defined by those port addresses, the initiator provides 
timeout information and the Service Level to be used by the target for any 
messages that it initiates. The SL from initiator to target need not be the 
same as from target to initiator, but the same SL must be used for both the 
request packets and any associated ACK or NAK packets associated with 
that request. Path information is available from Subnet Administration 
(see section 15.2.5.16 PathRecord ). 

For service resolution and QP association over already existing Reliable 
Datagram Channels, REQ:RDC Exists must be set. 

Table 88 REQ Message Contents 



Field 


Description 


Used for 
Purpose 


Byte [Bit] 
Offset 


Length, 
bits 


Values 


Local Communication ID 


See section 12.7.1 . 


C, EE 


0 


32 




(reserved) 






4 


32 




ServicelD 


See section 12.7.3 . 


C. EE 


8 


64 




Local CAGUID 


See section 12.7.9 


C, EE 


16 


64 
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Table 88 REQ Message Contents 1 
2 

, . - . ^. Used for Byte [Bit] Length, . o 

Descnpfon p^^p^^^ J^^^^/ ^.9 • Values 3 

Local CM Q_Key See section 12.7.8 C.EE 24 32 5 

Local Q_Key See section 12.7.13 EE 28 32 6 

Local QPN See section 12.7.12 . C.EE 32 24 ^ 

8 

Offered Responder Resources See section 12.7.29 C, EE 35 8 g 

Local EECN See section 12.7.14 EE 36 24 10 

Offered Initiator Depth See section 12.7.30 C.EE 39 8 11 

12 

Remote EECN See section 12.7.15 EE 40 24 

Z= 13 

Remote CM Response Timeout See section 12.7.4 C, EE 43 5 

Transport Service Type See section 12.7.6 . C.EE 43 [5] 2 15 

End-to-End Flow Control See section 12.7.26 C, EE 43 [7] 1 16 

17 

Starting PSN See section 12.7.31 C, EE 44 24 

- Z= 18 

Local CM Response Timeout See section 12.7.5 C, EE 47 5 -j g 

Retry Count See section 12.7.38 C.EE 47[5] 3 20 

Partition Key See section 12.7.24 C, EE 48 16 

22 

Path Packet Payload MTU See section 12.7.28 C. EE 50 4 ^3 

RDC Exists Whether RDC already EE 50[4] 1 1 if RDC exists, 24 

exists. 0 if RDC does not 
25 

RNR Retry Count See section 12.7.39 C,EE 50[5] 3 26 

Max CM Retries See section 12.7.27 C, EE 51 4 27 

(reserved) 51 [4] 4 

29 

Primary Local Port LID See section 12.7.11 . C, EE 52 16 

Primary Remote Port LID See section 12.7.21 . C, EE 54 16 31 

Primary Local Port GID See section 12.7.10 . C, EE 56 128 32 

33 

Primary Remote Port GID See section 12.7.20 . C. EE 72 128 

=Z= 34 

Primary Flow Label See section 12.7.18 C,EE 88 20 

(reserved) 90[4] 4 36 

Primary Packet Rate (IPD) See section 12.7.25 C, EE 91 8 37 

38 

Primary Traffic Class See section 12.7.17 C.EE 92 8 

= 39 

Primary Hop Limit See section 12.7.19 C. EE 93 8 4q 

Primary SL See section 12.7.16 C, EE 94 4 41 

42 
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Table 88 REQ Message Contents 



Field 


Description 


usea Tor 
Purpose 


Dyie iDiij 
Offset 


1 annth 

bits 


Values 


Primary Subnet Local 


See section 12.7.7 


C, EE 


94 [4] 


1 




(reserved) 






94 [5] 


3 




Primary Local ACK Timeout 


See section 12.7.34 


C, EE 


95 


5 




(reserved) 






95[5] 


3 




Alternate Local Port LID 


See section 12.7.11 


C. EE 


96 


16 




Alternate Remote Port LID 


See section 12.7.23 . 


C. EE 


98 


16 




Alternate Local Port GID 


See section 12.7.10 . 


C. EE 


100 


128 




Alternate Remote Port GID 


See section 12.7.22 . 


C. EE 


116 


128 




Alternate Flow Label 


See section 12.7.18 


C.EE 


132 


20 




(reserved) 






134[4] 


4 




Alternate Traffic Class 


See section 12.7.17 


C.EE 


135 


8 




Alternate Hop Limit 


See section 12.7.19 


C, EE 


136 


8 




Alternate Packet Rate (IPD) 


See section 12.7.25 


C, EE 


137 


8 




Alternate SL 


See section 12.7.16 


C, EE 


138 


4 




Alternate Subnet Local 


See section 12.7.7 


C, EE 


138[4] 


1 




(reserved) 






138[5] 


3 




Alternate Local ACK Timeout 


See section 12.7.34 


C, EE 


139 


5 




(reserved) 






139[5] 


3 




PrivateData 


See section 12.7.35 


C, EE 


140 


736 





1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



12.6.6 MRA - Message Receipt Acknowledgment 



MRA is sent in response to a REQ or REP message when the recipient of 
the message anticipates that it will not be able to respond within the time 
specified by REQ: Remote CM Response Timeout . MRA is sent to pre- 
vent the other party in the communication establishment protocol from ei- 
ther unnecessarily timing out the communication establishment attempt or 
flooding the link with unnecessary retries. 
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Table 89 MRA Message Contents 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



Field 



Description 



Used for 
Purpose 



Byte{Bit] 
Offset 



Length, 
bits 



Values 



Local Communication ID 


See section 12.7.1 . 


C. EE 


0 


32 




Remote CommunicationID 


See section 12.7.2 . 


C. EE 


4 


32 




Message MRAed 


The message being MRAed. 


C, EE 


8 


2 


0x0 - REQ, 
0x1 - REP 
0x2- LAP 


(reserved) 






8[2] 


6 




ServlceTimeout 


See section 12.7.32 


C. EE 


9 


5 




(reserved) 






9[5] 


3 




rnvaieuaia 


See section 12.7.35 . 


C, EE 


10 


1776 




12.6.7 REJ - Reject 














REJ indicates that the sender will not continue through the communica- 
tion establishment sequence, and the reason why it will not. 




Table 90 REJ Message Contents 






Field 


Description 


Used for 
Purpose 


Byte[Bit] 
Offset 


Length, 
bits 


Values 


Local Communication ID 


See section 12.7.1 . 


C. EE 


0 


32 


0 if REJecting a REQ 
and no MRA was sent 


Remote CommunicationID 


See section 12.7.2 . 


C. EE 


4 


32 




Message REJected 


The message being 
REJected. 


C,EE 


8 


2 


0x0 - REQ 
0x1 - REP 

0x2 - Unknown/No 
message 


(reserved) 






8[2] 


6 




Reject Info Length 


If non-zero, the length in 
bytes of valid Additional 
Reject Information 


C, EE 


9 


4 




(reserved) 






9[4] 


4 




Reason 


Error code indicating the 
reason for the sender's ter- 
mination of the communica- 
tion establishment process. 


C. EE 


10 


16 
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Table 90 REJ Message Contents 



Field 


Description 


Used for 
Purpose 


Byte[Bit] 
Offset 


Length, 
bits 


Values 


Additional Reject Informa- 
tion (ARI) 




C, EE 


12 


128 




Private Data 


See section 12.7.35 . 


C. EE 


28 


1632 





1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



12.6.7.1 Example REJ message 



The content of the fields of a REJ that rejects a REQ because of an unac- 
ceptable primary port LID and suggests that a primary port LID of 200 be 
used are shown in the table below. 



Table 91 Example REJ Message 



Field 


Contents 


Local Communication ID 


0 


Remote Communication ID 


0 


Message REJected 


0 


Reason 


15 


Reject Info Length 


2 


Additional Reject Informa- 
tion (ARI) 


200 


PrivateData 


empty 



12.6.7.2 Rejection Reason 



Code 



Reason 



No QP available 



Description 



The REQ message required the recipient to allocate a QP, and 
none were available 



Meaning of Additional 
Reject Information Field 



3 ' 
4 



No EEC available The REQ message required the recipient to allocate an EE 
context, and none were available 

No resources The REQ message required the recipient to allocate resources 
available other than QPs or EE contexts, and none were available 

Timeout The CM protocol timed out waiting for a message 
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Code Reason 

5 Unsupported 
request 

6 Invalid Communi- 
cation ID 



7 Invalid Communi- 
cation Instance 



8 Invalid Service ID 



9 Invalid Transport 
Service Type 

10 Stale connection 



11 RDC does not 
exist 

12 Primary Remote 
Port GID rejected 

13 Primary Remote 
Port LID rejected 

14 Invalid Primary 
SL 

15 Invalid Primary 
Traffic Class 

16 Invalid Primary 
Hop Limit 

17 Invalid Primary 
Packet Rate 



1 8 Alternate Remote 
Port GID rejected 

1 9 Alternate Remote 
Port LID rejected 

20 Invalid Alternate 
SL 



Description 



Receiving CM does not support this request. 



The recipient received a CM message in which the Local Com- 
munication ID, Remote Communication ID, or both, were 
invalid. 

The Local Communication ID, Remote Communication ID, 
QPN/EECN tuple does not refer to any valid communication 
instance. 

The recipient of the REQ message does not recognize or does 
not support the service associated with the specified ServicelD 

The recipient of the REQ message did not recognize the 
requested Transport Service Type 

The recipient of the REQ determined that It already had a con- 
nection with the "Local QPN" or "Local EECN" specified in the 
REQ. Upon receiving a REJ with this reason, the REJ recipi- 
ent shall cause the OP or EE context to be placed into the 
TimeWait state as described in section 12.9.8.4 . 

The Reliable Datagram Channel described in the REQ (Local 
EECN/Remote EECN) does not exist. 

The recipient of the REQ message could not (or would not) 
accept the Primary Remote Port GID 

The recipient of the REQ message could not (or would not) 
accept the Primary Remote Port LID 

The recipient of the REQ message does not support the 
requested Primary SL 

The recipient of the REQ message does not support the 
requested Primary Traffic Class 

The recipient of the REQ message could not (or would not) 
accept the Primary Hop Limit 

The recipient of the REQ message could not adjust its trans- 
mitter to send as slowly as would be required to comply with 
the requested Primary Packet Rate 

The recipient of the REQ message could not (or would not) 
accept the Alternate Remote Port GID 

The recipient of the REQ message could not (or would not) 
accept the Alternate Remote Port LID 

The recipient of the REQ message does not support the 
requested Alternate SL 



Meaning of Additional 
Reject Information Field 



GID of acceptable port. 



LID of acceptable port. 



Acceptable SL. 



Acceptable Traffic Class 



Acceptable Hop Limit 



Minimum acceptable Packet 
Rate 



GID of acceptable port. 
LID of acceptable port. 
Acceptable SL. 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 
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Code Reason 

21 Invalid Alternate 
Traffic Class 

22 Invalid Alternate 
Hop Limit 

23 Invalid Alternate 
Packet Rate 



24 Port and CM 
Redirection 



25 Port Redirection 



26 Invalid Path MTU 



27 Insufficient 
Responder 
Resources 

28 Consumer Reject 



29 RNR Retry Count 
Reject 



Description 

The recipient of the REQ message does not support the 
requested Alternate Traffic Class 

The recipient of the REQ message could not (or would not) 
accept the Altermate Hop Limit 

The recipient of the REQ message could not adjust its trans- 
mitter to send as slowly as would be required to comply with 
the requested Alternate Packet Rate 

The recipient of the REQ message supports the requested Ser- 
vice ID, but at the port specified by the ARI. Further CM mes- 
sages should be sent to that port as well. 

The recipient of the REQ message supports the requested Ser- 
vice ID, but at the port specified by the ARI. Further CM mes- 
sages shall be sent to the port to which the original REQ was 
sent. 

The recipient of the REQ message cannot support the maxi- 
mum packet paytoad size specified 

The value of Responder Resources (for RDMA Read/Atomics) 
in the REP message was insufficient. 



The consumer decided to reject the communication or EE con- 
text setup establishment attempt for reasons other than those 
listed above. (Typically this happens based upon information 
being conveyed In the PrIvateData field of a message.) 

The recipient of the message rejects the RNR NAK Retry count 
value. 



Meaning of Additional 
Reject Information Field 

Acceptable Traffic Class 



Acceptable Hop Limit 



Minimum acceptable Packet 
Rate 



GID of port to use for further 
CM messages and In new 
REQ. 

GID of port to propose in new 
REQ. 



Maximum acceptable maxi- 
mum packet payload size 



Defined by the consumer 



12.6.8 REP - Reply to Request for Communication 



REP is returned In response to REQ, indicating that the respondent ac- 
cepts the ServlcelD, proposed primary port, and any parameters specified 
In the PrIvateData area of the REQ. 

Table 92 REP Message Contents 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 



Field 


Description 


Used for 
Purpose 


Byte[Bit] 
Offset 


Length, 
bits 


Values 


35 
36 


Local Communication ID 


See section 12.7.1 . 


C. EE 


0 


32 




37 


Remote Communication ID 


See section 12.7.2 . 


C. EE 


4 


32 


Value present in REQ 


38 


Local Q_Key 


See section 12.7.13 


EE 


8 


32 




— 39 
40 


Local QPN 


See section 12,7.12 , 


C, EE 


12 


24 




41 


42 
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Table 92 REP Message Contents 




1 


Field 


Description 


Used for 
Purpose 


Byte[Bit] 
Offset 


Length, 
bits 


Values 


- 2 

o 
o 

. 4 


^reserveuj 






15 


8 




5 


Locai ct uoniexi NumDer 


See section 12.7.14 


EE 


16 


24 




6 


^reserveoj 






19 


8 




7 
o 




See section 12.7.31 


C.EE 


20 


24 




o 
9 








23 


8 




10 


Responder Resources 


See section 12.7.29 


C, EE 


24 


8 




11 


Initiator Depth 


See section 12.7.30 


C.EE 


25 


8 




13 


Target ACK Delay 


See section 12.7,33 


C, EE 


26 


5 




14 



Failover Accepted 



See section 12.7.36 . 



C, EE 



26[5] 



0: Failover accepted 
1: Failover port rejected, 
failover not supported 
2: Failover port rejected for 
reason other than failover 
not supported 



End-To-End Flow Control 


See section 12.7.26 


C. EE 


26[7] 


1 


RNR Retry Count 


See section 12.7.39 


C.EE 


27 


3 


(reserved) 






27[3] 


5 


PrIvateData 


See section 12.7.35 


C, EE 


28 


1632 



15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



12.6.9 RTU ■ Ready To Use 



RTU indicates that the connection is established, and that the recipient 
may begin transmitting. 

Table 93 RTU Message Contents 



Field 


Description 




Used for 
Purpose 


Byte[Bit] Offset 


Length, 
Bits 


Local Communication ID 


See section 12.7.1 . 


C, 


EE 


0 


32 


Remote CommunicationID 


See section 12.7.2 . 


C. 


EE 


4 


32 


PrIvateData 


See section 12.7.35 


C. 


EE 


8 


1792 
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12.6.10 DREQ - Request for communication Release (Disconnection REQuest) 

DREQ is sent to initiate the connection release sequence. 



Table 94 DREQ Message Contents 



Field 


Description 


Used for 
Purpose 




Byte[Bit] 
Offset 


Length, bits 


Local Communication ID 


See section 12.7.1 . 


C. EE 


0 




32 


Remote CommunicationID 


See section 12.7.2 . 


C, EE 


4 




32 


Remote QPN/EECN 


See section 12.7.37 


0, EE 


8 




24 


(reserved) 






11 




8 


PrivateData 


See section 12.7.35 


C, EE 


12 




1760 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



The values for Local and Remote Communication ID are those that were 
used to create the channel. 



12.6.11 DREP ■ Reply to Request for communication Release 



DREP is sent in response to DREQ, and signifies that the sender has re- 
ceived the DREP. 



Table 95 DREP Message Contents 



Field 


Description 


Used for 
Purpose 




Byte[Bit] 
Offset 


Length, bits 


Local Communication ID 


See section 12.7.1 . 


C. EE 


0 




32 


Remote CommunicationID 


See section 12.7.2 . 


C, EE 


4 




32 


PrivateData 


See section 12.7.35 


C, EE 


8 




1792 



12.7 Message Field Details 



The following table sumnnarizes each of the message fields, and indicates 
where the consumer can find the contents necessary to populate the field. 
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Table 96 Message Field Origins 



Field 


Populated From 


Local Communication ID 


The consumer sending the REQ messaae chooses this value. See section 12.7.1 


Remote CommunicationID 


The consumer replvino to the REQ messaae chooses this value. See section 12.7.2 


Service ID 


Assuming that the consumer uses the InfiniBand^'^ service naming facility, this comes 




from the ServiceRecord. as defined in section 15.2.5.14 ServiceRecord. 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



Remote CM Response Timeout 



Local CM Response Timeout 



Transport Service Type 



Subnet Local 



Local CM Q_Key 



Local CA GUID 



Local Port GID 



Local Port LID 



Local QPN 



Local Q_Key 



The consumer should set this field to be large enough to allow enough time under nor- 
mal circumstances for the recipient to be able to process the incoming message and 
have the response message traverse the path between source and destination. The 
service time at the recipient depends upon the service being requested, but the maxi- 
mum time it could take to successfully traverse the path can be found in the PathRecord 
as defined in section 15.2.5.16 PathRecord . (How the particular path to be used is 
selected is a policy decision that Is left up to the consumer.) 

This timeout period needs to allow for the path between the source and destination to be 
traversed twice, and also to allow for the REP message to be processed. The amount 
of time it takes to service the REP message may depend upon the service that was 
requested, but the maximum time it could take to successfully traverse the path can be 
found In the PathRecord as defined in section 15.2.5.16 PathRecord . 

The consumer sets this based upon the type of service It is requesting: Reliable Con- 
nected. Unreliable Connected, or Reliable Datagram. 

This can be determined by comparing the Portlnfo:SubnetPreflx fields associated with 
the Local Port GID and the Remote Port GID. The Portlnfo record Is defined In section 

15.2.5.2 PortlnfoRecord . 

The consumer can detemnine this for an HCA by querying the Queue Pair that it is using 
to send the message. The Query Queue Pair verb is defined in section 11.2.3.3 Query 
Queue Pair . (How this Information Is determined for a TCA is implementation-specific.) 

This Information can be found in the Nodelnfo:NodeGUID field, as defined in section 

14.2.5.3 Nodelnfo . (Which CA to use is a policy decision that is left up to the consumer.) 

This information can be found In the Gidlnfo record, as defined in section 15.2.5.19 
GuidlnfoRecord . (Which port on the CA to use and which of the available GIDs on the 
chosen port to use is a policy decision that is left up to the consumer.) 

This Information can be found in the Portlnfo:LID field, as defined in section 14.2.5.6 
Portlnfo . (Which port on the CA to use and which of the available LIDs on the chosen 
port to use is a policy decision that Is left up to the consumer.) 

The consumer can determine this for an HCA by querying the Queue Pair that it is offer- 
ing up for connection establishment. The Query Queue Pair verb is defined In section 
11 .2.3.3 Query Queue Pair . (How this information is determined for a TCA Is implemen- 
tation-specific.) 

The consumer can determine this for an HCA by querying the Queue Pair that it Is offer- 
ing up for connection establishment. The Query Queue Pair verb is defined in secfion 
11.2.3.3 Query Queue Pair . (How this Information is determined for a TCA is implemen- 
tation-specific.) 
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Table 96 Message Field Origins 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



Field 



Populated From 



Local EECN 



Remote EECN 



Service Level 



Traffic Class 



The consumer can detennine this for an HCA by querying the EE Context that it is offer- 
ing up for communications establishment. The Query EE Context verb is defined in sec- 
tion 11.2.6.3 Query EE Context . (How this information is determined for a TCA is 
implementation-specific.) 

The data originates on the remote end of an existing connection, and is returned to the 
local end in a REP message. It is determined by the remote end in the same manner as 
the Local EECN. 

This information can be found in the PathRecord:SL field, as defined in section 
15.2.5.16 PathRecord . 

This information can be found in the PathRecord:TCIass field, as defined in section 
15.2.5.16 PathRecord. 



Flow Label The purpose of this field is to identify a group of packets that must be delivered in order. 

See section 8.3 Global Route Header for a description of how this value is chosen. 



This information can be found in the PathRecord :HopLimit field, as defined in section 
15.2.5.16 PathRecord . 

This information can be found in the Gidlnfo record associated with the remote port, as 
defined in section 15.2.5.19 GuidlnfoRecord . The port that should be targeted based on 
the service being requested can be found in the ServiceRecord, as defined in section 
15.2.5.14 ServiceRecord . 

This information can be found in the Portlnfo:LID field associated with the remote port, 
as defined in section 14.2.5.6 Portlnfo . The port that should be targeted based on the 
service being requested can be found in the ServiceRecord, as defined in section 
15.2.5.14 ServiceRecord . 

This information can be found in the Gidlnfo record associated with the remote port, as 
defined in section 15.2.5.19 GuidlnfoRecord . The port that should be targeted based on 
the service being requested can be found in the ServiceRecord, as defined in section 
15.2.5.14 ServiceRecord . 

This information can be found in the Portlnfo;LID field associated with the remote port, 
as defined in section 14.2.5.6 Portlnfo . The port that should be targeted based on the 
service being requested can be found in the ServiceRecord, as defined in section 
15.2.5.14 ServiceRecord . 

This information can be found In the PathRecord:P_Key field, as defined in section 
15.2.5.16 PathRecord . 

This information can be found in the PathRecord:Rate field, as defined in section 
15.2.5.16 PathRecord . 

End-to-End Flow Control All HCAs are required to support End-to-End Flow Control, and so if the CA that the ini- 

tiator is using is an HCA this field should be set to 1. Whether or not End-to-End Flow 
Control is supported by a TCA is an implementation option, and it is therefore outside 
the scope of the InfiniBand architecture to specify the origin of this field in a TCA. 

The value of this field is a policy decision that is outside the scope of Communication 
Management to define. The field is discussed in section 12.7.27 . 



Hop Limit 



Primary Remote Port GID 



Primary Remote Port LID 



Alternate Remote Port GID 



Alternate Remote Port LID 



Partition Key 



Packet Rate 



Max CM Retries 
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Table 96 Message Field Origins 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



Field 



Path Packet Payload MTU 



Populated From 



This information can be found in the Path Record :IVItu field, as defined in section 
15.2.5.16 PathRecord. 



Responder Resources 



Initiator Depth 



Starting PSN 



Service Timeout 



The consumer can determine the maximum supported value for a QP/EEC by querying 
the HCA that will be used for communication. The Query HCA verb is defined in section 
11.2.1.2 Querv HCA 

The consumer can determine the maximum supporte value for a QP/EEC by querying 
the HCA that will be used for communication. The Query HCA verb is defined in section 
11.2.1.2 Querv HCA 

The value of this field is a policy decision that Is outside the scope of Communication 
Management to define. The field Is discussed in section 12.7.31 . 

The consumer should set this field to be large enough to allow enough time for it to com- 
plete the processing of the incoming message and have the response message that it 
sends out traverse the path between source and destination. The Incoming message 
processing time depends upon the service being requested and potentially other state, 
but the maximum time it could take to successfully traverse the path can be found In the 
PathRecord as defined in section 15.2.5.16 PathRecord . 

The value of this field is a policy decision that is outside the scope of Communication 
Management to define. The field is discussed in section 12.7.33 Target ACK Delay . 

The value of this field Is a policy decision that is outside the scope of Communication 
Management to define. The field is discussed In section 12.7.34 Local ACK Timeout . 

PrivateData The contents of this field are outside the scope of what the InfiniBand™ specification 

defines; the usage (if any) of this field is specified by higher-level communications 
establishment protocols. 



Target ACK Delay 



Local ACK Timeout 



Failover Accepted 



Set as per the description in section 12.7.36 Failover Accepted . 



Remote QPN/EECN 



This should be the same as the Local QPN/Local EECN returned in the REP message. 



Retry Count 



RNR Retry Count 



The value of this field Is a policy decision that is outside the scope of Communication 
Management to define. The field is discussed In section 12.7.38 Retry Count . 

The value of this field is a policy decision that is outside the scope of Communication 
Management to define. The field is discussed in section 12.7.39 RNR Retry Count . 



12.7.1 Local Communication ID 



An Identifier that uniquely identifies this connection from the sender's 
point of view. The sender must use the same identifier for all phases of 
communication establishment and release. It must not reuse a Local 
Communication ID for the life of the connection, or while any messages 
related to the connection could still be In the fabric. (How long a message 
related to the connection could still be in the fabric is touched upon in sec- 
tion 12.9.8.4 .) The Communication ID allows the recipient to determine 



InfiniBand^'^ Trade Association 



Page 533 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



12.7.2 Remote Communication ID 



InfiniBand^'^ Architecture Release 1 .0 Communication Management October 24, 2000 

Volume 1 - General Specifications FINAL 

whether the message is a duplicate of an old message, or represents a 1 
new connection request. 2 

3 
4 

An identifier that uniquely identifies this connection from the recipient's 5 
point of view. (As an example, for a REP message this would be the same g 
as the Local Communication ID that was received in the REQ message.) ^ 
The values in the Local and Remote Communication ID fields in the Com- 
munication Management MADs are exchanged between requests and re- ^ 
plies. 9 

10 

The pair of (Local Communication ID, Remote Communication ID) is used 1 1 
to reference connections during establishment, failover management, and ^ 2 
release. CM messages with invalid Communication IDs shall not be pro 
cessed, and shall be rejected as specified in section 12.6.7 . 



12.7-4 Remote CM Response Timeout 



12.7.6 Transport Service TyPE 



13 
14 



12.7.3 ServiceID 15 

1 6 

An identifier that specifies the service being requested. The ServiceID 
field specifies the service number desired by the requestor. These in- 

elude, but are not limited to, the service numbers defined for typical TCP 1 8 

services. The mappings between services and ServicelDs are outside the 1 9 

scope of Communication Management. See Volume 3, section 3.3. 20 

21 
22 

The time, expressed as (4.096 |aS*2^®"'^^® Response Timeout)^ ^j^^^jp 23 
which the CM message recipient shall transmit a response to the sender. 24 
This value is unsigned. The recipient uses this information to determine 25 
whether it should send an MRA. (See section 12.9.8.5 ) 25 

12.7.5 Local CM Response Timeout 

28 

The time, expressed as (4.096 ^3*2'-°^' Response Timeout)^ ^^at the re- 29 
mote CM shall wait for a response from the local CM to a CM message 
sent by the remote CM. This value is unsigned. Note that whereas Re- 
mote CM Response Timeout is the time between receipt of a message 
and transmission of a response, Local CM Response Timeout includes ^2 
that "turn-around" time, as well as round trip packet flight time. (See sec- 33 
tion 12.9.8.5 ) The initiating CM is responsible for determining this value, 34 
through Subnet Management or other means. 35 

36 
37 

Specifies desired service type: Reliable Connected, Unreliable Con- 33 
nected, or Reliable Datagram. 

40 
41 
42 
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12.7.7 Subnet Local 



12.7.8 Local CM Q Key 



12.7.9 Local CAGUID 



12.7.10 Local Port GID 



12.7.11 Local Port LID 



12.7.12 LOCAL QPN 



12.7.13 Local Q Key 



12.7.14 Local EECN 



12.7.15 Remote EECN 



0: Local and remote are on different subnets (LID fields not valid) 

1 : Local and remote are on same subnet (GID fields are still valid, though) 



The Q_Key used by the sending CM. This value must be used in mes- 
sages sent to the CM. 



The EUI-64 GUID of the sending Channel Adapter. 



The GID of the local CA port on which the channel is to be established. If 
an alternate path is not to be specified, the Alternate Local Port GID field 
shall be set to zero. If this field is non-zero, it shall contain a valid GID. 



The LID of the local CA port on which the channel is to be established. If 
an alternate path is not to be specified, the Alternate Local Port LID field 
shall be set to zero. 



The QPN of the message sender's QP on which the channel is to be es- 
tablished. One Reliable Datagram QP may be associated with multiple 
EE contexts. A QPN must be specified when establishing an RD channel, 
but use of this QPN is not limited to this RDC. Once a consumer estab- 
lishes a Reliable Datagram Channel, the consumer may use additional 
QPs over the RDC without an additional connection establishment ex- 
change. 

CM shall not be used to connect the Send Work Queue of a QP to the Re- 
ceive Work Queue of the same QP. (If so desired, the consumer can do 
this using the Modify QP verb.) Attempting to do this may result in unpre- 
dictable behavior when doing connection establishment between peers. 



(RD Only) The Q_Key for the QP specified by Local QPN. 



The EE Context Number for the message sender's end of the RD channel. 



The EE Context Number for the remote end of the existing Reliable Dat- 
agram channel. 0 if REQ:RDC Exists is not set 
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12.7.16 Service Level i 

The value to be placed in the Service Level field for packets sent by the 2 
recipient. For more information on Service Levels, see section 7.6.5 Ser- 3 
vice Level on page 150 . 4 

5 

12.7.17 Traffic Class 6 

Defines Traffic Class for globally-routed packets. 7 

8 

12.7.18 Flow Label g 

Defines Flow Label for globally-routed packets. 10 

11 

12.7.19 Hop Limit ^2 

The maximum number of hops a packet can make between subnets be- 13 
fore being discarded. 

12.7.20 Primary Remote Port GID 

16 

The GID of the remote node's CA port on which the local node wishes to 
establish the channel. The remote node may send REJ to reject this port, 
and may optionally suggest an acceptable port. - 

19 

12.7.21 Primary Remote Port LID 20 

The LID of the remote node's CA port on which the local node wishes to 21 

establish the channel. The remote node may send REJ to reject this port, 22 

and may optionally suggest an acceptable port. The sender is respon- 23 

sible for ensuring that the LID and GID refer to the same port. 24 

25 

12.7.22 Alternate Remote Port GID 

26 

As in section 12.7.20 . A CA that does not support automatic failover shall 27 
set the REP 'Failover Accepted' field to one. If this field is zero, it shall 

... ^.r^ ^O 

contain a valid GID. 

29 

12.7.23 Alternate Remote Port LID 20 

31 

As in section 12.7.21 . A CA that does not support automatic failover shall 
set the REP 'Failover Accepted' field to one. 

33 

12.7.24 Partition Key 34 

The Partition Key to be used for the channel being established. 35 

36 

12.7.25 Packet Rate 37 

The maximum rate at which the remote may transmit over this channel, 38 
specified as described in section 9.11 Static Rate Control . 39 

40 
41 
42 
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12.7.26 End-to-End Flow Control 



12.7.27 Max CM Retries 



Signifies whether the local CA actually implements End-to-End Flow Con- 
trol (1), or instead always advertises 'infinite credits'(O). See section 
9.7.7.2 End-to-End (Message Level) Flow Control for more detail. 



Maximum number of times that either party can re-send a REQ, REP, or 
DREQ message. After re-sending for the maximum number of times 
without a response, the sending party should then terminate the protocol 
by sending a REJ message indicating that it timed out. 

12.7.28 Path Packet Payload MTU 

Specifies the maximum packet payload size, in bytes, for the channel 
being established. One of 256, 512, 1024, 2048,4096. This value applies 
to both the primary and alternate paths. 

12.7.29 Responder Resources 

The maximum number of outstanding RDMA Read/Atomic operations the 
sender will support from the remote QP/EEC. This value may be zero. 
The maximum number that the HCA can support for a QP/EEC can be de- 
termined using the Query HCA verb. See section 11.2.1.2 Query HCA . 
Upon receiving the REP message, the requestor must decide whether the 
offered resources are sufficient for the intended use. If not, it may send 
the REJ message to discontinue the connection establishment. 



12.7.30 Initiator Depth 



12.7.31 Starting PSN 



12.7.32 Service Timeout 



The maximum number of outstanding RDMA Read/Atomic operations the 
sender will have to the remote QP/EEC. The Initiator Depth chosen by 
one side of the channel shall not exceed the Responder Resources of- 
fered by the other side. The maximum number that the HCA can support 
for a QP/EEC can be determined using the Query HCA verb. See section 
11.2.1.2 Query HCA . 



The transport Packet Sequence Number at which the remote node (rela- 
tive to the sender of the REQ or REP message) shall begin transmitting 
over the newly established channel. This value should be chosen to min- 
imize the chance that a packet from a previous connection could fall within 
the valid PSN window. 



Present in the MRA. The maximum time required for the sender to send 
a REP, RTU, APR, or REJ (as appropriate). This yalue is expressed as 



(4.096 ^S*2 



Service Timeout 



) from the time the MRA is posted to the Send 



queue. The recipient of the MRA shall wait the specified time, plus a 
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packet lifetime, after receiving this message before timing out. (See sec- 
tion 12.9.8.5 ) This value is unsigned. 



12.7.33 Target ACK Delay 



1 
2 
3 
4 

(4.096 ^S*2^^'^^^ represents the maximum time between the 5 
target CA's reception of a message and the transmission of the associated 6 
ACK or NAK. This is information furnished by the target to the recipient, j 
It provides the recipient with information about the maximum message g 
processing latency of the target, which is one component of the overall 
time it takes to get an ACK or NAK after having sent a request packet. 
(The other component is the network propagation delay, which depends 
upon the configuration of the switches and routers between the two end- 11 
points as well as the congestion in the network.) The recipient of the mes- 1 2 
sage containing the Target ACK Delay should use this value along with ^3 
the recipient's best estimate of the network propagation delay to deter- 
mine how long to wait before timing out a packet transmission to the 
target. This value is unsigned. 



18 
19 



14 
15 
16 

12.7.34 Local ACK TIMEOUT 17 

Value representing the transport (ACK) timeout for use by the remote, ex- 
pressed as (4.096 ^5*2'-°^^' Timeout^ Calculated by REQ sender, 
based on (2 * Expected Time to traverse the path between source and 
target CA + Local CAs ACK delay). Although the remote CA is not re- 21 
quired to use this value for its ACK timeout, it is strongly encouraged to do 22 
so. 23 

24 

If too small a value is chosen for the Local ACK Timeout, the number of 25 
packet transmission timeouts reported by the remote CA may increase, 
which may increase the amount of work that is required in the CA to suc- 
cessfully send a packet. If too large a value is chosen, the amount of time 27 
that it takes to notice that a packet has not been successfully transmitted 28 
(e.g. due to a CRC error on the wire) will be increased, which may in- 29 
crease the amount of time it takes to recover from or report such errors. 39 



31 
32 



The maximum amount of time that it could take any packet to successfully 
traverse the path between source and target is contained in the Path- 
Record, as defined in section 15,2^5J6_PathRe^ Local Ack Timeout 
is unsigned. 34 

35 

12.7.35 PrivateData 36 

Data that is opaque to the communication management protocol, passed 37 
from the sender to the recipient. The recipient may choose to accept or 38 
reject the request based on the private data. The format and meaning of 39 
the PrivateData field is specific to the ServicelD and message type, and 
is not specified within Communication Management. See Volume 3, Sec- 
tion 3.3. 

42 



InfiniBand^'^ Trade Association 



Page 538 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 Communication Management October 24, 2000 

Volume 1 - General Specifications FINAL 



12.7.36 Failover Accepted 



12.7.38 Retry Count 



12.7.39 RNR RETRY COUNT 



12.7.37 Remote QPN/EECN 



Indicates whether the target of the REQ accepted or rejected the Alternate 
port address contained in the REQ. By sending the REP, the target ac- 
cepts the connection request, but it may still reject the proposed failover 
port. 

If failover is accepted, each CM shall cause the associated QP (for 
RC/UC) or EEC (for RD) specified by Local QPN to be placed in the 
REARM Migration State (see section 17.2.8.1 Automatic Path Migration 
Protocol ). 

If failover is rejected, each CM shall cause the associated QP or EEC to 
be placed in the Migstate:MIGRATED state upon transition to the RTR 
state. 



The remote (relative to the sender) QPN or EECN, as appropriate, that is 
the subject of the message. Provides an additional check that the (Local 
Communication ID, Remote Communication ID) pair references the cor- 
rect resource. 



The total number of times that the sender wishes the receiver to retry tim- 
eout, packet sequence, etc. errors before posting a completion error. See 
sections 9.9.2.1.1 Requester Error Retrv Counters and 9.9.2.4.1 Re- 
quester Class A Fault Behavior for details of how the retry counter works. 



The total number of times that the REQ or REP sender wishes the re- 
ceiver to retry RNR NAK errors before posting a completion error. See 
sections 9.9.2.1.1 Requester Error Retrv Counters and 9.9.2.4.1 Re- 
quester Class A Fault Behavior for details of how the RNR retry counter 
works. 



12.8 Alternate Path Management 



IBA supports Automatic Path Migration (see section 17.2.8 Automatic 
Path Migration ), in which a channel's traffic (RC, DC, RD) may be moved 
to a pre-determined alternate path. The initial alternate path is estab- 
lished at connection setup, but if a migration occurs, a new path needs to 
be specified before re-enabling migration. 

Two messages are specific to alternate path management. LAP - Load 
Alternate Path carries the new path information. APR - Alternate Path 
Response informs the requester of the status of the LAP request. 
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The MRA message may be sent by the LAP recipient if it is unable to send 
the APR message within the Remote CM Response Timeout . As the 
LAP is idempotent, the message may re-sent if there is no response, or if 
the Service Timeout is not met. The recipient shall return a failure status 
in the APR if the LAP request specifies an alternate path that is the same, 
in every respect, as the primary path. There is no limit on the number of 
LAP messages that a sender may have outstanding, but a sender shall 
have no more than one LAP outstanding per remote QP/EEC at any time. 

The QP/EEC state changes requested by the LAP and APR messages 
may be effected through the ModifyQP or ModifyEE verbs (sections 
11.2.3.2 and 11.2.6.2 ). 



B 



LAP 



MRA 



APR 



Figure 121 Loading alternate patli 



12.8.1 LAP - Load Alternate Path 



LAP is an optional message used to change the altemate path informa- 
tion for a specific connection. It may be sent to update the alternate path 
information if fabric changes cause it to become invalid, or to load the 
"new" alternate path Information after a path migration occurs. Loading 
alternate path information does not initiate the migration process for auto- 
matic failover; it just specifies which path is to be used when the path mi- 
gration occurs. 
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Table 97 LAP Message Contents 



Field 


Description 


Byte [Bit] 


Length, bits 


Local Communication ID 


See section 12.7.1 . 


0 


32 


Remote Communication ID 


See section 12.7.2 . 


4 


32 


Local CM Q_KEY 


See section 12.7.8 


8 


32 


Remote QPN/EECN 


See section 12.7.37 


12 


24 


Remote CM Response Timeout 


See section 12.7.4 


15 


5 


(reserved) 




15[5] 


3 


(reserved) 




16 


32 


Alternate Local Port LID 


See section 12.7.11 


20 


16 


Alternate Remote Port LID 


See section 12.7.23 . 


22 


16 


Alternate Local Port GID 


See section 12.7.10 . 


24 


128 


Alternate Remote Port GID 


See section 12.7.22 . 


40 


128 


Alternate Flow Label 


See section 12.7.18 


56 


20 


(reserved) 




58[4] 


4 


Alternate Traffic Class 


See section 12.7.17 


59 


8 


Alternate Hop Limit 


See section 12.7,19 


60 


8 


Alternate racKet Kaie lirU} 


oee section iz.^.iio 


01 


Q 

o 


Alternate SL 


See section 12.7.16 


62 


4 


Alternate Subnet Local 


See section 12.7.7 


62[4] 


1 


(reserved) 




62[5] 


3 


Alternate Local ACK Timeout 


See section 12.7.34 


63 


5 


(reserved) 




63[5] 


3 


Private Data 


See section 12.7.35 


64 


1344 



12.8.2 APR - Alternate Path Response 



APR is sent in response to a LAP request. MRA may be sent to allow 
processing of the LAP. 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 541 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Communication Management 



October 24. 2000 
FINAL 



Table 98 APR Message Contents 



Field 


Description 


Byte[Btt] Offset 


Length, bits 


Local Communication ID 


See section 12,7.1 . 


0 


32 


Remote Communication ID 


See section 12.7.2 . 


4 


32 


Additional Information 




8 


128 


AP status 


See section 12.8.2.1 


24 


4 


(reserved) 




24[4] 


4 


Private Data 


See section 12.7.35 


25 


1656 



12.8.2.1 AP Status 



Value Meaning 

0 Alternate path information loaded 

1 Invalid Communication Instance tuple 

2 Altemate paths not supported 

3 Altemate path information rejected 

4 Alternate path information rejected - redirect 

5 Proposed alternate path matches current primary path 



If AP Status is "Alternate path information rejected - redirect", the Addi- 
tional Information field contains the GID of the port that the APR sender 
suggests to support the alternate path. The LAP sender may send a new 
LAP proposing the port specified. 

12.9 State Transition Diagrams For Communication Establishment and Release 

The diagrams in this section detail all valid states and state transitions in 
the IBA communication establishment and release protocols. Section 
12.10 contains ladder diagrams which illustrate various paths through this 
state diagram. 
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The InfiniBand™ communication establishment and communication re- ^ 
lease protocols are structured so that they will always run to completion in 2 
a bounded amount of time. "Completion" for the communication establish- 3 
ment protocol means that the communication will either be established, or 4 
else the state of all parties involved in the communication will revert to idle g 
as if no communication had ever been established. "Completion" for the 
communication release protocol means that the communication is re- 
leased; this protocol never fails to run to completion. 



12.9.2 Invalid State Input Handling 



10 

11 



6 
7 
8 

12.9.1 Diagram Description 9 

There is only one communication establishment protocol for InfiniBand™, 
with different messages used for different scenarios. The state diagrams 

are broken into an "active side" and a "passive side". The active side of 1 2 

the protocol is the side that is trying to initiate a transition out of one of the 1 3 

terminal states (Idle and Established). The passive side of the protocol is 14 

the side that is responding to the active side. 5 

16 
17 

In many of the states of the InfiniBand™ communication establishment 18 
and release protocols, there is a defined set of input messages that can ^ g 
legally be received while in that state. The general rule for handling input 20 
messages that cannot be legally received and acted upon while in that 
state is to ignore them. A CM shall not retry the REQ, REP, or DREQ mes- 
sages more than the number of times specified by REQ:Max CM Retries. 22 

23 

12.9.3 TIMEOUTS 24 

A lost or dropped message will ultimately result in a timeout. Since all par- 25 

ties will ultimately return to the idle state, there is no correctness require- 26 

ment to do retries of a message send as a result of a timeout, although it 27 

is recommended. Senders of retried messages may not modify the con- 28 

tents of the messages between retries. 29 

on 

In the following state diagrams, "Timeout" represents a Response Tim- 
eout. Service Timeouts are specifically noted. 31 

32 

12.9.4 State Diagram Notes 33 

All REJs, sent or received, cause a return to IDLE(active) or LISTEN(pas- 34 
sive), possibly through the TimeWait state (see section 12.9.8.4 ). 35 

36 

In the Active Communication Establishment diagram, the transition from 37 
the Peer Compare state to the Passive REQ_Rcvd state only happens if ^3 
the ServicelD in the REQ received is the same as the ServicelD in the 
REQ that was sent. (See section 12.10.4 for details ). OthenA^ise, a new ^® 
connection establishment instance shall be started. 

41 
42 
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The ServicelD implicitly defines whether the service is client/server or 1 
peer to peer, but the server application must inform its CM so that the CM 2 
will handle the inbound REQ correctly. 3 
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12.9.5 Communication Establishment and Release - Active 
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Figure 122 Communication Establishment (Active Side) 
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Figure 123 Communication Establishment(Passive Side) 
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12.9.7 State and Transition Definitions 



The following tables define each state and the possible transitions from 
the state. 

These tables define the protocol, and take precedence in the case of a 
conflict with the state diagrams. 

In this table, "CEP" (Channel EndPoint) means QP or EEC, as appro- 
priate. 



12.9.7.1 Active States 



CM state 



IDLE 



REQ Sent 



Peer Compare 



REP wait 



Event 



Entry 

Send REQ 
(default) 



Receive REP 
Receive REQ 

Receive MRA(REQ) 
Response Timeout 
Receive REJ 
(default) 



Entry 



Receive REP 
Service Timeout 
Receive REJ 



Action/Transition Sequence 



CEP to RESET 

CEP to InitiaUzed / Ser\6 REQ / CM to REQ Sent 
None 



CM to REP Rcvd/CEP to Ready to Receive 

IF (ServicelDs match) 
to Peer Compare 

to REP wait 

to Timeout 

CEP to Error/CfA To IDLE 
None 



IF (local CA GUID higher than remote CA QUID) 

to REQ Sent 
ELSE 

to Passive: REQ Rcvd 



CM to REP Rcvd / CEP to Ready to Receive 
Send REJ / CEP to Error I OU to IDLE 
CEP to Error I CU to IDLE 
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CM State 



REP Rcvd 



MRA(Rep) sent 



Established 



DREQ Sent 



DREQ Rcvd 



Event 



(default) 



Send RTU 
Send MRA(REP) 
Send REJ 
(default) 



Send RTU 
Send REJ 
(default) 



Receive DREQ 
Send DREQ 
Receive REQ 
(default) 



Timeout 
Receive DREQ 
Receive DREP 
(default) 



Send DREP 
(default) 



Action/Transition Sequence 



None 



CM to Established/ Send RTU / CEP to Ready to Send 

to MRA(REP) Sent 

CEP to Error/ CM to IDLE 

None 



CM to Established / Send RTU / CEP to Ready to Send 

CEP to Error/ CM to IDLE 

None 



CM to DREQ Rcvd 
CM to DREQ Sent 
See section 12.9.8.3.1 
None 



to DREP Timeout 
CM to DREQ Rcvd 
To TimeWait 
None 



to TimeWait 
None 
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TimeWait 



Entry 

Receive DREQ 
Timer Expiration 
(default) 



CM: Start Timer/ CEP to RESET 

CM: Send DREP (if Max CM Retries not exceeded) 

CM To IDLE 

None 
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CM State 



Event 



Action/Transition Sequence 



1 

2 

3 

4 
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10 
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Timeout 



Entry (Retry) 

(Max Retries not exceeded) 
Entry (No Retry) 



Send REQ/to REQ Sent 



Send REJ / CEP to Error/ CIVI to IDLE 



DREP Timeout 



Entry (Retry) 

(Max Retries not exceeded) 
Entry (No Retry) 
(default) 



Send DREQ / to DREQ Sent 

To TimeWalt 
None 



12.9.7.2 Passive States 



State 



Event 



Action/Transition Sequence 



LISTEN 



Entry 

Receive REQ 
(default) 



CEP to RESET 

CEP to Initialized I CU to REQ Rcvd 
None 



REQ Rcvd 



Send REP 
Send MRA(REQ) 
Send REJ 
(default) 



CEP to Ready to Receive I Send REP / CM to REP Sent 
to MRA Sent 

CEP to Error/ CM to LISTEN 
None 



MRA sent 



Send REP 
Send REJ 
(default) 



CEP to Ready to Receive / Send REP / CM to REP Sent 

CEP to Error/ CM to LISTEN 

None 



REP Sent 



Receive RTU 
Receive MRA(REP) 



CEP to Ready to Send / CM to Established 
To MRA(REP) Rcvd 
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State 


Event 


Action/Transition Sequence 




Receive message on service CEP 


CEP to Ready to Send / CM To Established 




Receive REJ 


CEP to Error/ CM to Listen 




Timeout 


To RTU Timeout 




(default) 


None 


MRA(Rep) rcvd 








Receive RTU 


CEP to Ready to Sendl CM to Established 




Service Timeout 


Send REJ / CEP to Error/ CM to TimeWait 




Receive message on service CEP 


CEP to Ready to Sendl CM To Established 




Receive REJ 


CEP to Error/ CM to LISTEN 




(default) 


None 


Established 








Send DREQ 


CM to DREQ Sent 




Receive DREQ 


CM to DREQ Rcvd 




Receive REQ 


See section 12.9.8.3.1 




(default) 


None 


DREQ Rcvd 




Send DREP 


CM to TimeWait 




(default) 


None 


DREQ Sent 








Timeout 


CM to DREP Timeout 




Receive DREQ 


Send DREP / CM to TimeWait 




Receive DREP 


CM to TimeWait 




(default) 


None 


RTU Timeout 








Retry REP 


To REP Sent 




No Retry 


Send REJ / CEP to Error/ CM to TimeWait 




(default) 


None 



TimeWait 
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State Event 

Entry 

Receive DREQ 
Timer Expiration 
(default) 



Action/Transition Sequence 

CM: Start Timer/ CEP to RESET 

CM: Send DREP (if Max CM Retries not exceeded) 

CM To IDLE 

None 



DREP Timeout 



Entry (Retry) 

(Max Retries not exceeded) 
Entry (No Retry) 
(default) 



12.9.8 State Details 
12.9.8.1 Timeout 



12.9.8.2 RTU Timeout 



12.9.8.3 Established 
12.9.8.3.1 REQ Received 



12.9.8.4 TimeWait 



Send DREQ / to DREQ Sent 

To TimeWait 
None 



A message may be re-sent no more than REQ: Max CM Retries . but there 
is no requirement that it be re-sent that many times. 



If the Passive agent sends REP but does not receive either an RTU or a 
message on the CEP (QP or EEC, as appropriate), it transitions to RTU 
Timeout. If it has not exceeded REQ: Max CM Retries . the Passive agent 
may resend REP. 



(RC, UC) If a REQ Is received specifying a remote port/QPN considered 
by the local CM to be connected to a local QPN, the local CM shall issue 
REJ, then DREQ until DREP received or Max Retries exceeded. The 
local QP shall be placed in TimeWait state. 

(RD) If a REQ is received specifying a remote port/EECN considered by 
the local CM to be connected to a local EECN, the local CM shall issue 
REJ, then DREQ until DREP received or Max Retries exceeded. Local 
EEC shall be placed in TimeWait state. 



The PathRecord:PacketLifeTime (section 15.2.5.16 PathRecord ) field 
defines the maximum time that a packet can exist in the fabric. 

The TimeWait timer shall be set to twice the PathRecord:PacketLife- 
Tlme value plus the remote's Ack Delay. 
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The CM is responsible for placing QPs/EECs in the TimeWait state, for 
maintaining them in that state for a period not less than the TimeWait pe- 
riod, and for removing them afterward. 

Receipt of a DREQ while in the TimeWait state shall not affect the Time- 
Wait timer. 



12.9.8.5 Message Receive Acknowledgment (MRA) 

A 



B 



Local CM 

Response 

Timeout 




REQ 



^ARA(REQ) 



Rennote CM 

Response 

Timeout 

I 





Remote CM 
T- Service 
Timeout 



Local CM 
Service 
Timeout 



Figure 124 MRA Example 



Figure 124 illustrates the use of the MRA message in a CM message ex- 
change. 'Local' and 'Remote' are with respect to 'A'. Because B cannot 
return a REP or REJ to A within the Remote CM Response Timeout, B 
sends an MRA(REQ), notifying A that B has received the REQ message 
and is processing it. The MRA(REQ) contains B's CM Service Timeout 
value. B completes its processing and sends the REP message to A be- 
fore the expiration of the Remote CM Service Timeout. 

Because packet flight times may differ due to fabric congestion, (e.g., the 
MRA may travel in the minimum possible time, and the REP in the max- 
imum time, as shown by T^jn and T^ax). A shall allow an additional Pack- 
etLifeTime for the REP to arrive. 
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12.9.8.6 Timeouts and Retries 



When A receives the REP, it realizes that the required processing will not 1 

allow it to transmit a REJ or RTU soon enough to arrive at B before the 2 

Local CM Response Timeout expires, so it sends an MRA(REP) con- 3 

taining its CM Service Timeout value. When it completes the REP pro- ^ 
cessing, A sends the RTU, which arrives before the Local CM Service 
Timeout expires. 

6 

Once an MRA is received, the CM shall not re-send the message ac- 7 

knowledged by the MRA sooner than the period of time represented by 8 

the applicable CM Service Timeout period plus PacketLifeTime. 9 

10 
11 

In the communication establishment protocol, the sending of the REQ, ^ 2 
REP and DREQ messages may be retried by the sender. The retry hap- 
pens after the sender fails to receive a response message from the recip- 
ient within the appropriate response timeout period. 

15 

For the REQ message, the Remote CM Response Timeout period is the 16 
recipients "turn-around" time. The REQ sender may consider the REQ 17 
(or response) lost after (2*PacketLifeTime + Remote CM Response Tim- 
eout). Upon receiving a REQ, the recipient must send a REP, REJ. or 
MRA by the Remote CM Response Timeout. The Service Timeout period 
begins when the MRA is sent, and a REJ or REP must be sent before it 
expires. ^ ' 

22 

The Local CM Response Timeout tells the REP sender how long to wait 23 
for an MRA, REJ, or RTU. The Local CM Response Timeout value in- 24 
eludes the round trip flight time. If the REP sender receives an MRA, it 25 
can expect the REJ or RTU within (Local CM Service Timeout + Pack- 
etLifeTime) after the MRA's arrival. 

The response timeout period for the DREQ message is the Local CM Re- 28 
sponse Timeout present in the original REQ message. 29 

30 

When the sender retries a message send, the recipient can potentially re- 3^1 
ceive multiple copies of the same message. The recipient of a REQ (or ^2 
REP) message should determine the amount of time it has to send a re- 
sponse based upon when it received the latest REQ (or REP) message; 
the remaining time it has to reply is thus reset back to the full response 34 
timeout period each time it receives a new REQ (or REP) for the same 35 
connection establishment attempt. 36 

37 

If the sender of a REP message receives another REQ message for the 33 
same connection establishment attempt, after it resends the REP mes- 
sage it should reset its response timeout period back to the full Local Re- 
sponse Timeout period that it received in the REQ message. 

41 
42 
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12.9.9 Connection State i 

Communication IVIanagers shall maintain the following information for the ^ 
life of a connection: 3 

4 

• Local Communication ID 5 

• Remote Communication ID 6 

• Local CM Response Timeout ^ 

Q 

• Remote CM Response Timeout 

• Local QPN / Local EECN ^ q 

• Remote QPN / Remote EECN -11 

• Remote LID / Remote GID 1 2 

• Max CM Retries ^3 

14 

12.10 Communication Establishment Ladder Diagrams 15 



The following ladder diagrams show the message exchanges for various ' ° 
communication establishment scenarios. These are not applicable to Un- 17 
reliable Datagrams (see section 12.11 for Service ID Resolution). 18 

19 

12.10.1 Active Client to Passive Server ■ Both Client and Server Accept Communication 20 

Active Passive 

22 

23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 

Figure 125 Active/Passive, Both Accept 39 

40 
41 

For RC and UC service, the above exchange establishes the connection. 
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For RD service, the above exchange must be performed 

• To establish a Reliable Datagram Channel between two EECs 

• To resolve a Service ID and associate QPs for possible use over 
an existing RDC 

How cooperating applications exchange information on additional avail- 
able QPs is specific to the applications. 



12.10.2 Active Client to Passive Server - Server Rejects Communication 



Active 



Passive 



REQ 



REJ 



Figure 126 Active/Passive, Server Reject 



The above exchange occurs when the passive server cannot or will not 
perform the requested action. The REJ message contains the reason 
why the action was not performed. 
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12.10.3 Active Client to Passive Server - Client Rejects Communication 



Active 



Passive 



REQ 



REP 



REJ 



Figure 127 Active/Passive, Client Reject 



The above exchange occurs when the requesting client decides not to 
continue with the requested action. (An example is a client that requires 
Automatic Path Migration support not provided by the server.) The REJ 
message contains the reason why the action was not continued. 
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12.10.4 Peer to Peer - Both Accept Communication 



Peer A 



Peer B 



REQ 



REQ 




REP 



B LOSES COMPARISON 
AND TAKES PASSIVE 
ROLE 



RTU 



Figure 128 Active/Active, Both Accept 



The above exchange occurs when two peer entities attempt communica- 
tion. In this case, the ServicelD in both REQ messages Is the same. The 
ServlcelD Implicitly defines whether the service Is client/server or peer to 
peer, but the application must inform the CM so that the CM will handle 
the inbound REQ correctly. (For Instance, If a client on A wishes to estab- 
lish a connection to a server on B at the same time a client on B wishes 
to establish a connection to the same service on A, each CM must know 
that, because the services are client/server, there are two active/passive 
connection instances in progress, and not a single active/active Instance.) 

Peer A and Peer B compare their CA (Channel Adapter) GUIDs, treating 
each as a big-endlan value, to decide which party will take the active side 
of the CM protocol. The peer with the numerically smaller GUID assumes 
the passive role in the remainder of the communication establishment pro- 
tocol. 

If the CA GUIDs match (e.g., two processes using the same CA), the 
REQ: Local QPN fields shall be compared, treating each as a big-endian 
value, with the smaller QPN taking the passive role. 

CM shall not be used to establish "loopback" channels on a single QP. 
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12.10.5 Active Peer to Active Peer - Passive Rejects Communication 



Peer A 



Peer B 




REQ 



REQ 



REJ 



B LOSES COMPARISON 
AND TAKES PASSIVE 
ROLE 



Figure 129 Active/Active, Passive Reject 



The above exchange occurs when the 'losing' peer decides not to con- 
tinue the requested action. The REJ message contains the reason the 
action was not continued. 
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12.10.6 Active Peer to Active Peer - Active Rejects Communication 



Peer A 



Peer B 




REQ 



REQ 



REP 



REJ 



B loses comparison 

AND TAKES PASSIVE 
ROLE 



Figure 130 Active/Active, Active Reject 



The above exchange occurs when the when the 'winning' peer decides 
not to continue with the requested action. The REJ message contains the 
reason why the action was not continued. 
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12.10.7 Active Client to Passive Server with Redirector - All Accept Communication 



CLIENT 



Redirector 



SERVER 



REQ 



REJ 



REQ 



REP 



RTU 



(Not visible to client) 



DATA 



Figure 131 Redirection, Accepted 



A redirector is a CM that provides CM services on belialf of an entity sup- 
ported on a port other than the redirector CM's port. The port information 
for both endpoints is explicit in the REQ and REP messages, allowing the 
redirector to manage connections as a proxy for another entity. 

The above exchange occurs when the redirector sends REJ with the 
Status "Port Redirection", indicating that the requested ServicelD is avail- 
able at a different port. The requesting client (using a new Local Commu- 
nication ID) sends a new REQ proposing the port specified in the REJ. All 
CM messages are exchanged between the client CM and the redirector, 
but traffic over the established connection goes between the client and the 
server. 
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12.10.8 Communication Release 



The following ladder diagram shows the message exchange for commu- 
nication release. 

Communication release as illustrated in this section is ungraceful. Upon 
recipt of a Disconnect Request, each CIVI shall cause the affected QP to 
be placed into the error state, causing pending work requests to complete 
with the Flush error status. 

Consumers are free to define and execute a more graceful communica- 
tion release protocol that allows for an orderly shutdown of communica- 
tions. Any such protocol shall utilize the communication release protocol 
Illustrated below after the termination of normal message processing. 



12.10.8.1 Disconnect Request 



B 



DREQ 



DREP 



Figure 132 Disconnect Request 



Because the DREQ and DREP travel out of band relative to normal com- 
munications traffic, how operations currently in progress will be completed 
cannot be predicted. 



12.11 Service ID Resolution Protocol 



Service ID Resolution (SIDR) provides a way for users of Unreliable Dat- 
agram (UD) service to determine a Queue Pair on the target port that sup- 
ports a given Service ID. 

GSAs shall support this protocol if non-management services are pro- 
vided on the Channel Adapter at other than fixed QPNs. If this protocol is 
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not supported, the status "The method/attribute combination is not sup- 1 
ported" shall be returned, as described in section Table 104 MAD 2 
Common Status Field Bit Values . 3 

4 

The protocol consists of a single request and a single reply, using unreli- 
able Management Datagrams (MADs) targeted to the GSI. SIDR mes- 
sages are of the Communication Management class. ^ 

7 

If the SIDR response returns a valid QPN, the returned QPN shall be in 8 
the partition identified by the P_Key in the header of the SIDR Request, g 

10 

12.11.1 SIDR^REQ - Service ID Resolution Request 

SIDR_REQ requests that the recipient return the information necessary to -j 2 
communicate via UD messages with the entity specified by ^3 
SIDR_REQ:ServicelD. ^ ^ 

Table 99 SIDR REQ Message Contents 

^ :: 16 

Field Description Byte[Bit] Offset Size, bits (Values) 17 

o 

RequestID See section 12.11.1.1 0 32 
19 

(reserved) 4 32 2q 

Sen/icelD See section 12.11.1.2 8 64 21 

Private Data See section 12.11.1.3 16 1728 22 
23 

12.11.1.1 RequestID 24 

A 32-bit value identifying the request. The target of the request shall re- 25 
turn this value unchanged in the SIDR_REP message. If a SIDR_REQ 26 
message is re-sent, the sender shall send the same RequestID. 27 

12.11.1.2 Service ID H 

29 

The Service ID that the sender wishes to have resolved. See section 
12.7.3 for more information. 

31 

12.11.1.3 Private Data 32 

Data that is opaque to the SIDR protocol for use by the requester and re- 
sponder. For example, some systems may require that the PrivateData 34 
area contain an authorization key before reporting the QP supporting cer- 35 
tain ServicelDs. 36 

37 
38 
39 
40 
41 
42 
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MM, 2 SIDR_REP - Service ID Resolution Response 

SIDR_REP returns the information necessary to communicate via UD 
messages with the entity specified by SIDR_REQ:ServicelD. 

Table 100 SIDR_REP Message Contents 



12.11.2.1 Status 



12.11.2.2 QPN 



12.11.2.3 Q Key 



Field 


Description 


Byte [Bit] Offset 


Length, bits 


RequestID 


See section 12.11.1.1 


0 


32 


QPN 


See section 12.11.2.2 


4 


24 


Status 


See section 12.11.2.1 


7 


8 


ServicelD 


See section 12.11.1.2 . 


8 


64 


Q_Key 


See section 12.11.2.3 


16 


32 


Private Data 


See section 12.11.1.3 


20 


1696 



The Status field tells whether the QPN field is valid, and if not valid, the 
reason a valid QPN was not provided. 

0 QPN is valid 

1 Service ID not supported 

2 Rejected by Service Provider 

3 No QP available 
4-255 Reserved 



The QPN of the local QP on which the requested Service ID is supported. 
(Only valid if so indicated by Status field). 



12.11.3 Path Information 



The Q_Key for the QP returned in QPN. 



The information returned in the SIDR_REP message is insufficient, by it- 
self, to create a usable address handle. Specifically, the values for Path- 
Record:Mtu and PathRecordiRate are required except when sending 
packet payloads no larger than the minimum PMTU, or when transmitting 
on a minimum-width link, respectively. These values are available 
through Subnet Administration (see section 15.2.5.16 ). 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
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14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
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27 
28 
29 
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31 
32 
33 
34 
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Chapter 13: Management Model i 

2 



13.1 Introduction 



6 

IBA management is built on top of four fundamental concepts. These in- 7 
elude: 



8 
9 
10 
11 



Management entities 

• Agents. 

• A messaging scheme. 12 

• A collection of specific messages including message content, and re- 13 
lated behaviors. 14 

An agent is a conceptualization of a body of low level functionality em- 1 5 

bedded in all channel adapters, switches, and routers, which provides the 1 6 

means to set and query various parameters internal to the channel 17 

adapter, switch, or router. ^ g 

19 

Managers and interested parties are conceptualizations of high level 
bodies of functionality which provide for controlling or examining various 
aspects of subnet or fabric configuration and operation. 21 

22 

The messaging scheme provides for intercommunication between man- 23 
agers or interested parties and agents, and, in some cases, between 24 
agents. The messaging scheme specifies the basic message types and 25 
interfaces through which agents and managers exchange information. 

26 

Finally, specific messages and message sequences are defined in terms 27 
of message content and associated required behaviors. Messages are 28 
grouped into classes according to the type of management activity the 29 
messages support. 30 

31 

The specification of management operations is done from the viewpoint 
of specifying messages that may appear on the wire and specifying be- 
haviors associated with those messages. The appearance of a message 
at a port implies a required action and, possibly, response. Additionally, 34 
the appearance of a message on the wire implies behavior of the entity 35 
that caused the message to be emitted. In particular, the behavior require- 36 
ments in certain areas (e.g. subnet management, see 14.4 Subnet Man- 37 
aqer on page 655 ) imply the existence of certain entities (e.g. a subnet ^3 
manager) which embody required behaviors with respect to the origina- 
tion and consumption of various messages. 

40 
41 
42 
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Various conceptualizations are used in specifying behaviors. However, 1 
the use of such conceptualizations in this and other management related 2 
chapters is purely a descriptive artifice. The conceptualizations them- 
selves, do not convey normative requirements. Normative requirement 
specification is done by, and only by, specification of message formats 
and associated required behaviors. Finally, while some conceptualiza- 
tions may suggest certain implementations, implementations are outside ^ 
of the scope of the specification and no specific implementation is implied. 7 

8 

13.2 Assumptions, and Scope 9 

13.2.1 Assumptions io 

There are certain assumptions that underlie the management mecha- 

nisms specified herein. Proper operation of the management mecha- 12 

nisms and fulfillment of the objectives underlying these specifications is 13 

predicated upon the validity of these assumptions. While the assumptions 1 4 

themselves are not part of the specification, they are an essential element >| 5 
of the framework in which these specifications apply. 

17 

The management operations specified herein provide for a level of in- 
teroperability such that an SM from any vendor can manage a heter- 
ogeneous collection of IBA-compliant channel adapters, switches, 1 9 
and routers from any set of vendors. However, compatibility and in- 20 
teroperability among SMs from different vendors is not supported. Mi- 21 
gration from one vendor's SM to another's by way of system re 22 
initialization, i.e., through a planned outage, is supported. Such mi- ^3 
gration assumes appropriate steps of transferring data between ven- 
dors' SMs have been accomplished prior to the re initialization. ^4 



25 
26 



• The management operations specified herein provide the means to 
conduct a variety of activities. Some of the mechanisms specified are 
optional. And, except as specifically stated, the specification of a 27 
means does not imply or require that the means be used. It is as- 28 
sumed that each fabric will be constructed, configured, and operated 29 
according to the needs of its user(s) and that constructors exercise 3Q 
diligence in selection of components to ensure the fabric possesses 3^ 
the characteristics required. For example, if a user requires multicast 
support but mixes components that do and do not support multicast, 
they may fail to achieve their requirements. 

34 
35 

As noted above, a number of management classes are distinguished in 
the IBA management model. The classes include: 

Subnet management. Subnet management is the body of activity 

associated with discovering, initializing, and maintaining an IBA 39 

subnet. In addition, the subnet management sections specify 40 

methods for interfacing to a diagnostic framework for handling 41 

subnet and protocol errors. (See 14.2.5.14 VendorPiac on page 42 
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648 ). In the following sections a subnet manager will be denoted 1 
by SM while a subnet management agent will be denoted by 2 
SMA. 3 

• Subnet administration (SA): Subnet administration provides a 4 
means for management entities and applications to obtain infor- 5 
mation about fabric configuration and operation. (See Chapter g 
15: Subnet Administration on oage 670 ). In the following sections 
subnet administration will be denoted by SA. 



10 
11 



7 
8 

Communication management: Communication management pro- g 
vides the means to set up and manage communications between 
a pair of queue pairs or, in certain cases, to identify which queue 
pair to use for a certain service. (See Chapter 12: Communication 
Management on page 516 and 16.7 Communication Manage- 12 
ment on page 786 ). In the following sections a communications 13 
manager will be denoted by CM while a communications man- 14 
agement agent will be denoted by CMA. 

Performance management: Performance management specifies I6 
a set of facilities for examining various performance characteris- ^7 
tics of a fabric. (See 16.1 Performance Management on page ^ g 
717 ). In the following sections a performance manager will be de- 
noted by PM while a performance management agent will be de- 
noted by PMA. 20 

Device management: Device management specifies the means 
for determining the kind and location of various kinds of devices 
on a fabric. (See 16.3 Device Management on page 761 ). In the 23 
following sections a device manager will be denoted by DM while 24 
a device management agent will be denoted by DMA. 25 

Baseboard management: Baseboard management specifies the 26 
means to effect, in-band (i.e. over the IBA fabric) low level system 27 
management operations. (See 16.2 Baseboard Management on 28 
page 749 and VOLUME 2, Chapter 13, Hardware Management). 29 
In the following sections a baseboard manager will be denoted by 
BA while a baseboard management agent will be denoted by 
BMA. 



21 
22 



31 
32 
33 



SNMP tunneling: SNMP tunneling specifies mechanisms to sup 
port transport of SNMP operations through an IBA fabric. (See 
16.4 SNMP Tunneling on page 772 ). In the following sections a 34 
SNMP tunneling agent will be denoted by SNMPA. 35 

Vendor specific: The vendor specific classes specify a basic 

framework within which a vendor can define vendor specific man- 37 

agement communications and operations that are beyond the 38 

scope of the IBA. See 16.5 Vendor-specific on page 781 for archi- 39 

tectural details and Volume 3 of the IBA Specification for recom- 40 
mended practices relating to use of specific values. 

42 
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• Application specific: The application specific classes specify a ba- 1 
sic framework within which services can be defined which innple- 2 
ment operations that are beyond the scope of the IBA. The 3 
intended use of the application specific classes is to support best ^ 
practices defined in Volume 3 of the IBA Specification. See 16.6 
Application-soecific on page 783 for architectural details and Vol- 
ume 3 of the IBA Specification for recommended practices relat- ^ 
ing to use of specific values. 7 

As a notational convenience, the set of classes as listed above but ex- ^ 

eluding subnet management are referred to as General Services. When 9 

referring to general services managers, the notation GSM may be used. 10 

When referring to general services agents, the notation GSA may be 11 

used. According to the context in which it appears, GSM(s) or GSA(s) may 1 2 
refer to the group of all supported general services managers or agents 
on a channel adapter, switch, or router, or to any of the managers or 

agents in that group. ^ ^ 

The IBA management services provided by the above classes support 1 6 

management of only the devices that comprise the IBA subnet. They do 17 

not support management of devices beyond the subnet. Specifically, they ^ g 

do not support management of tape drives, hard disk drives, network in- .jg 
terfaces, etc. Mechanisms required to discover and power-manage de- 
vices that are accessed through an IBA channel adapter are provided 
within the above classes but provision of services specific to such devices 

is beyond the scope of the IBA. 22 

23 

IBA management provides a means of configuring and gathering informa- 24 
tion from IBA channel adapters, switches, and routers.The IBA Subnet 25 
Administration Service provides a means for other entities to determine 25 
the topology and configuration of the subnet. For example, operating sys- 
tems or other higher level management entities may use IBA Subnet Ad- 
ministration services mechanisms to enforce operating system policies, or ^8 
cluster policies, and so on, but such higher level entities and the policies 29 
they effect are outside the scope of IBA management services. 30 

31 

A variety of standards for communication of management information be- 32 
tween managed elements and management applications exist today. 
These include SNMP, DMI, and CIM (Simple Network Management Pro- 
tocol, Desktop Management Interface, and Common Information Model), 
as well as other standard and proprietary interfaces. Such standards may 35 
be layered on top of the IBA management model interfacing to it through 36 
services defined in the model. Alternatively, they may interface to IBA 37 
management elements through private interfaces. In either case, while 33 
the IBA management model provides means for such applications to ob- 
tain subnet topology and configuration information, such applications are 
outside the scope of IBA management. 

41 
42 
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Finally, the current IBA specification defines only the mechanisms re- 
quired for proper operation of IBA fabrics and interoperation of IBA com- 
ponents. Specific applications, such as for enclosure management, can 
also be used in conjunction with non-IBA subsystems that are connected 
to the IBA subnet. Such applications may utilize the IBA subnet as a 
means of transport for the specific subsystem management data but such 
subsystem management services themselves are outside of the scope of 
the management services specified for IBA. 

Figure 133 Manaoement Model on oaae 568 . below, depicts an example 
subnet indicating graphically the relationships among the IBA managed 
subnet and related services and higher level and lower level entities that 
may be found on an IBA fabric. 



Host Based Endnode 



Managed Subnet 



Management Consoles 
and/or 

higher level management applications 




IBA Subnet Administration (SA) 
IBA Subnet Manager (SM) 



I/O Endnode 



End Node 



Controller 



Controller 



sensof; led vpd 



IB-M 




to peer 
subnet 



Figure 133 Management Model 

This chapter provides an overview of the IBA management model, the 
management entities, and the corresponding interfaces. In addition this 
chapter defines requirements and specifies mechanisms common to all 
management activities. Subsequent chapters specify additional details 
associated with specific management classes. For each management 
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13-3 Managers, Agents, and Interfaces 
13.3-1 Introduction 



9 

10 
11 



class, the complete set of applicable requirements that must be satisfied 1 

and mechanisms that must be provided is the combination of those from 2 

this chapter with those in the corresponding class specific sections of the 3 

other chapters. ^ 

5 
6 
7 

IBA Management is organized around abstract functional entities referred g 
to as managers and agents, and, interfaces. Communication between 
managers and agents is performed through management messages re- 
ferred to as Management Datagrams (MADs). MADs are exchanged 
using the unreliable datagram transport service as defined in 9.8.3 Unre 
liable Datagrams on oaae 329 . ''S 

13 

Managers are conceptual functional entities that effect control over fabric 1 4 
elements or provide for gathering information from fabric elements. In -j 5 
general, managers may reside anywhere in a subnet although class spe- 
cific constraints on the manner in which they logically interface to the 
fabric medium (e.g. SMs use QPO, see 13.5.1 MAD Interfaces on pace 
600 ) may impose specific restrictions. 

19 

Agents are conceptual functional entities present in IBA channel adapters, 20 
switches, and routers that process management messages arriving at the 21 
ports of the IBA channel adapters, routers, and switches where they 22 
exist.The functionality represented by an agent effects required behaviors 23 
associated with MADs which arrive at the port or ports with which it is as- 
sociated. '^^ 

25 

Abstractly, interfaces represent a target to which messages may be sent 26 
and through which messages will be processed or will be dispatched to an 27 
appropriate processing entity. For management interfaces, the associated 28 
processing entity is an agent or, in some cases, a manager. As such, an 29 
interface is a means to gain access to the functionality of agents and/or 
managers. 

Management operations are divided into a set of classes. For a given 32 
class of activity, there is usually only a small number of managers on a 33 
subnet. Conceptually, for each supported class, there is one agent on 34 
each switch, channel adapter, and router on the IBA subnet. 25 

op. 

Although the notions of agent, manager, and interface as described 
above, may suggest specific implementations, this specification only man- 

dates behavior with respect to sourcing and sinking management mes- 38 

sages, not how that behavior is achieved. The notion of an agent, 39 

manager, or interface, is a convenient descriptive artifice which encapsu- 40 

lates functional operations and behaviors associated with a particular 4^ 

class of activities. This specification does not require the existence of ^2 
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agents, managers, or interfaces per se. It does require tliat implementa- 1 

tions exiiibit the behaviors associated with the abstract agents, managers, 2 

and interfaces. How that is actually accomplished is implementation de- 3 

pendent. ^ 

5 

The messages and behaviors relating to the subnet management class 
are further defined in 14.2 Subnet Management Class on oaae 611 . This ^ 
class uses specialized MADs referred to as Subnet Management Packets 7 
(SMPs). 8 
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Figure 118 depicts a single subnet showing representative relationships 
among channel adapters, switches, subnet managers and agents. 




Port- Port 



Any switch, channel 
adapter, or router may host . 
a Subnet Manager There 
may be multiple Subnet 
Managers in a subnet, one 
master and several standby. 




Denotes an Endnode. 



Denotes an switch. 



Denotes subnet man- 



Figure 134 A Single Subnet Depicting Representative 
Subnet IVIanager/Agent Relationships 

The messages and behaviors relating to the subnet administration class 
are further defined in 1 5.2 SA MADs on page 671 while the messages and 
behaviors relating to the other general services classes are further de- 
fined in the subsections of Chapter 16: General Services on pace 717 . 
These service classes use MADs referred to as General Services Man- 
agement Packets (GMPs). 
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Figure 119 depicts a single subnet showing representative relationships 
between general service class managers and corresponding agents. 




GS* 
Port Portj 

0 



Denotes an Endnode. 



GS* is an abstrac- 
tion. The Mgr/ Agent 
will be one of: 

Subnet Adminis- 
tration 

• Performance 
mgmt 

• Comm mgmt 

• SNMP tunneling 

• Device mgmt, De- 
vice configuration 



Denotes an switch. 




Denotes general services manager. 

Figure 1 35 A Single Subnet Depicting Representative 
General Services Management/Agent Relationship 

13.3.2 Required Managers and Agents 

C13-1: Each subnet shall have at least one logical SM. 

Logical SMs may be single physical entities or may consist of multiple, possibly 
distributed, cooperating physical entities which collectively effect the appearance 
of a single SM to CAs switches and routers on the subnet it manages. 

If there is more than one entity capable of acting as a master SM , only one 
should function as a master SM during Initialization. 

See Chapter 14: Subnet Manaoementon oaoe 610 for additional specific 
requirements applicable to SMs during and after initialization. 
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There is a close relationship between SMs and SAs. This is described in 1 
15.1.2 Relationshio Between SA and the SM on page 671 . 2 



3 
4 
5 



IBA version 1 .0 does not otherwise mandate the existence of, the location 
of, or, operational characteristics of GSMs. The class specific sections of 
Chapter 16: General Services on oaoe 717 define messages and agent 
behaviors available that GSMs depend on but there are no manager spe- ^ 
cific messages or related behaviors that GSMs must support. 7 

8 

Every IBA compliant channel adapter, switch, or router should support the 9 
functionality characterized as an SMA. Supporting the functionality char- ^ q 
acterized as a SMA means that the channel adapter, switch, or router 
sources and sinks messages and effects related behavior as specified in 
the corresponding class specific section of Chapter 16: General Services ^ ^ 
on page 717 . The specific requirements for supporting this functionality at 1 3 
the various ports of the device are specified in the chapter covering the 14 
specific type of device. See Chapter 17: Channel Adapters on page 790 . 1 5 
Chapter 18: Switches on page 813 and Chapter 19: Routers on page 830 . 



13.4.1 Conventions 



17 
18 



Every IBA compliant channel adapter, switch, or router should support the 
functionality characterized as the various GSAs for those general services 
specified to be mandatory in the class specific section of Chapter 16: Gen- ^ ^ 
eral Services on page 717 . Supporting the functionality characterized as 20 
a GSA means that the channel adapter, switch, or router sources and 21 
sinks messages and effects related behavior as specified in the corre- 22 
spending class specific section of Chapter 16: General Services on page 23 
717 . The specific requirements for supporting this functionality at the var- 
ious ports of the device are specified in the chapter covering the specific 
type of device. See Chapter 17: Channel Adapters on pace 790 . Chapter 
18: Switches on page 813 and Chapter 19: Routers on page 830 . 26 

27 
28 
29 

13.4 Management Datagrams 3q 

Management Datagrams (MADs) are the basic elements of the mes- 31 

saging scheme defined for management communications. MADs are 32 

classified into predefined management classes and for each MAD there 33 
is a specified format, use, and behavior. This section specifies character- 
istics, i.e. formats and associated behaviors, common to all MADS or 

common across multiple classes. MADs specific to a class are specified 35 

in class specific sections of Chapter 14: on page 610 . Chapter 15: on page 36 

670, and Chapter 16: on page 717 . 37 

38 
39 

C1 3-2: For all MADs, for both the fields in the MAD header as well as the 40 
fields in MAD attributes, bit placement follows the conventions specified 4^ 

42 
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in 1.6 Document Conventions on page 37 . In addition, the following con- 1 
ventions shall be observed. 2 

3 

• Fields within a MAD may be either fixed length or variable length ^ 
within a fixed length location. A variable length field placed in a 
fixed length location is placed in the high order bits of the fixed 
length location and the remainder of that location is filled with ze- ^ 
ro. 7 

• Reserved fields must be filled with 0 by the requester and ignored ^ 
by the receiver. 

• When constructing a response MAD that contains all or part of 
the corresponding request MAD, it is acceptable to include the 
contents of reserved fields in the request MAD in the response 12 
MAD without regard to their content. That is, such fields need not 13 
be set to zero in the response MAD. -14 

In attribute descriptions in subsequent sections, fields specified 1 5 
as read only (RO) are not alterable by means of MADs. The 16 
mechanisms for setting such fields are implementation depen- ^-j 
dent and outside of the scope of the IB A. With respect to MADs ^ g 
that set values, recipients shall ignore any bits in the attribute in a 
request that correspond to RO components of the attribute being 
set. 



9 

10 
11 



19 
20 
21 

In attribute descriptions in subsequent sections, fields specified 
as read write (RW) are settable by means of MADs. 

23 
24 

C13-3: The data payload (as used in Chapter 9: Transport Laver on page 25 
196 ) for all MADs shall be exactly 256 bytes. ^5 

013-4: The data payload shall include, and only include, the items defined 
in the MAD base format in Figure 136 MAD Base Format on page 575 . 28 
with semantics as described in Table 101 Common MAD Fields on page 29 
5Z5. 30 

31 

All MADs consist of a MAD header and MAD data. Except as noted, the ^2 
MAD header definition is the same for all MADs. The contents of MAD 

33 
34 
35 
36 
37 
38 
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data areas vary by management class and the specific attribute within the 
class. 

Figure 136 MAD Base Format 



bytes 










U 


BaseVersion 


MgmtClass 


CiassVersion 


R Method 


4 


Status 


ClassSpecific 


8 


TransactionID 


12 


16 


Attribute! D 


Reserved 


20 


AttributeModifier 


24 


Data 




252 



13.4.3 Management Datagram Fields 

Table 101 Common MAD Fields on page 575 lists fields that are common 
to all MADs. Each class may specify additional class specific usage for 
certain of these fields. 

Table 101 Common MAD Fields 



Field Name 


Length 
(bits) 


Offset 
(bits) 


Description 


BaseVersion 


8 


0 


Version of MAD base format. This shall be 1 . 


MgmtClass 


8 


8 


Class of operation. See Table 102 Manaaement Class Values on 
oaae 576 for definition and use. 


CiassVersion 


8 


16 


Version of MAD class-specific format. This shall be 1 , except for 
the Vendor class where it shall be 1 or greater subject to vendor 
versioning. 


R 


1 


24 


ResDonse bit. See 13.4.5 Manaaement Class Methods on oaae 
577 for definition and usage. 


Method 


7 


25 


Method to perform based on the manaaement class. See 13.4.5 
Manaaement Class Methods on oaae 577 for definition and 
usage. 


Status 


16 


32 


Code indicatina status of operation. See 13.4.7 Status Field on 
oaae 587 for definition and usaae. 


ClassSpecific 


16 


48 


This field is reserved except for the Subnet Management class. 
See 14,2.1.2 BMP Data Format - Directed Route on oaae 612 
for definition and usage for Subnet Management. 
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Table 101 


Common MAD Fields 


Field Name 


(bits) 


Offcat 

LnTsei 
(bits) 


Description 


TransactionlD 


64 


64 


Transaction identifier. See 13.4.6.4 TransactionlD usaae on 
oaae 586. This field, if unused bv the manaaement class, shall 
be set to 0. 


AttributelD 


16 


128 


Defines objects being operated on by a management class. This 
field, if unused, shall be set to 0. See 13.4.8 Manaaement Class 
Attributes on oaae 587 as well as class soecific sections of 
Chaoter 14: on oaae 610. Chapter 15: on oaae 670. and Chan- 
ter 16: on oaae 717 for definition and usaae. 


Reserved 


16 


144 




AttributeModi- 
fier 


32 


160 


Provides further scope to the attributes. Usage id determined by 
the management class and attribute. This field, when not used 
for the combination of management class and attribute specified 
in the header, shall be set to 0. 


Data 


1856 


192 


The data area, usage is defined within the scope of the manage- 
ment class. 



13.4.4 Management Classes 



CI 3-5: MADHeadenMgmtClass shall be one of the values defined in 
Table 102 Management Class Values on page 576 not defined as re- 
served. 

The functionality provided by specific classes is specified in Chapter 14: 
Subnet Management on page 610 . Chapter 15: Subnet Administration on 
page 670 . and Chapter 16: General Services on page 717 . 

Table 102 Management Class Values 



Management 
Class 


Value 


Description 


Required Support for 
Class 


Reference 
Section 


Subn 


0x01 


Subnet Management class 
(LID routed) 


All channel adapters, 
switches, and routers. 


14.2 Subnet Man- 
agement Class on 
paae 611 


Subn 


0x81 


Subnet Management class 
(Directed route) 


All channel adapters, 
switches, and routers. 


14.2 Subnet Man- 
aaement Class on 
oaae 611 


SubnAdm 


0x03 


Subnet Administration class 


All channel adapters, 
switches, or routers, 
hosting a subnet man- 
ager 


15.2 SA MADs on 
oaae 671 


Perf 


0x04 


Performance Management 
class 


All channel adapters, 
switches, and routers. 


16.1 Performance 
Manaaement on 
oaae 717 
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Table 102 Management Class Values 



Management 
Class 


Value 


Description 


Required Support for 
Class 


Reference 
Section 


BM 


0x05 


Baseboard Management 
class (tunneling of IB-ML 
commands through the IBA 
subnet) 


All channel adapters, 
switches, and routers. 


16.2 Baseboard 
Manaoement on 
paae 749. 


DevMgt 


0x06 


Device Management class 


Optional. 


16.3 Device Man- 
agement on paqe 
761 


CommMgt 


0x07 


Communication Manage- 
ment class 


All channel adapters. 


16.7 Communica- 
tion Management 
on pace 786 


SNMP 


0x08 


SNMP Tunneling class 
(tunneling of the SNMP pro- 
tocol through the IBA fab- 
ric) 


Optional 


16.4 SNMP Tun- 
nelina on paae 
772 


Vendor 


0x09-0x0F 


Vendor Specific classes 


Optional 


16.5 Vendor-soe- 
cific on oaae 781 


Application 


0x1 0-0x1 F 


Application Specific classes 


Optional 


16.6 Application- 
specific on oaae 
783 




0x00 

0x20-0x80 
0x82-0xFF 


Reserved 







With respect to the column labeled Required Support for Class, an indica- 
tion that support is required indicates that at least some aspects of the 
class must be supported. Complete details of which aspects are manda- 
tory and which aspects are optional are specified in the corresponding ref- 
erence section. 



13.4.5 Management Class Methods 



Methods define the operations that a management class supports. In ad- 
dition to supporting methods common to multiple classes, each manage- 
ment class may define additional class specific methods. 

The upper bit of the Method field is designated as the response bit (R). It 
is used to distinguish three types of messages based upon the type of 
method included in the header as follows: 
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Message methods are methods for which no response is ever gener- 
ated. The R bit is not set (i.e. it is 0) and the corresponding method 
with the R bit set is reserved and not used. 

Request methods are methods for which a response may be generat- 
ed. The R bit is not set (i.e. it is 0) and the corresponding method with 
the R bit set is defined and potentially used to convey a response. 

• Response methods are methods generated in response to receipt of 
a request method. The R bit is set (i.e. it is 1) and the corresponding 
method with the R bit not set is defined and used to trigger (request) 
the response. 

See section 13.4.6.4 TransactionID usage on page 586 for required re- 
quest/response behavior. 

C13-6: The method names and method values shown in Table 103 
Common Management Methods on page 578 shall be used in a manner 
consistent with the descriptions contained in this subsection ( 13.4.4 Man- 
agement Classes on page 576 1 

013-7: The values assigned to the common methods shall not be used for 
any class-dependent method even if the common method is not sup- 
ported. 

C13-8: Class specific methods defined to be requests and responses 
shall conform to the request response definitions in this section, the re- 
quest response requirements specified in Section 13.4.6.4 TransactionID 
usage on page 586 . and shall use the R bit according to the semantics of 
types of methods defined above. 



Table 103 Common Management Methods 



Name 


Type 


Value 
(including R bit) 


Description 


Get() 


Request 


0x01 


Request (read) an attribute from a channel adapter, switch, or 
router. See 13.4,6.1.1 Get/GetReso on oaae 580. 


Set() 


Request 


0x02 


Request a set (write) of an attribute in a channel adapter, 
switch, or router. See 13.4.6.1.2 Set/GetResp on paae 580. 


GetRespO 


Response 


0x81 


The response from an attribute Get or Set request. See 
13.4.6.1.1 Get/GetResD on paae 580 and 13.4.6.12 Set/Get- 
Resp on paae 580. 


SendO 


Message 


0x03 


Send a datagram. Does not require a response. See 
13.4.6.1.3 Send on oaae 581. 
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Table 103 Common Management Methods 



Name 


Type 


Value 
(including R bit) 


Description 


TrapO 


Message 


0x05 


An unsolicited datagram sent from a channel adapter, switch, 
or router Indicating an event occurred that may be of interest. 
See 13.4.6.1.4 Trap on oaae 582 and 13.4.9 Traos on paae 
594. 


ReportO 


Request 


0x06 


Used to fonward an event/trap/notice to interested party. See 
13.4.6.1.6 ReDort/ReoortResD on oaae 582 and 13.4.11 Event 
Fonft^ardina on oaae 597. 


ReportRespO 


Response 


0x86 


Response to a ReoortO, See 13.4.6.1.6 Reoort/ReDortReso 
on oaae 582 and 13.4.11 Event Forwardlna on oaae 597. 


TrapRepressO 


Message 


0x07 


Instruct a trap sender to cease sending a repeated trap. See 
13.4.6.1.5 TraoReoress on paae 582 and 13.4.9 Traps on 
oaae 594 for usaoe. 






0x00. 0x04, 0x08- 
OxOF. 0x80, 0x82- 
0x85, 0x87-0x8F 


Reserved. 






0x10-0x7F, 0x90- 
OxFF 


Class-specific methods. Use is defined by the class. 



For Get(), Set(), GetRespQ, and Send() methods, the combinations of 
method and attribute that are valid are class specific and are specified in 
the respective class sections in Chapter 14: on paae 610 . Chapter 14: on 
page 610 . and Chapter 16: on paae 717 . 

For TrapO and TrapRepress(), Report(), and ReportRespO methods, at- 
tribute usage is specified in Sections 13.4.9 Traps on paae 594 . and 
13.4.11 Event Fonwarding on page 597 . 

13.4.6 Management Messaging 

13.4.6.1 Methods and Message Sequencing 

For each MAD received by a responder, a given method may require a re- 
sponse to be returned. 

Responders generate responses as appropriate and required for each re- 
quest MAD received as specified in sections 13.4.6.1.1 Get/GetResp on 
page 580 . 13.4.6.1.2 Set/GetResp on page 580 . and 13.4.6.1.6 Re- 
port/ReportResp on page 582 



CI 3-9: Responders shall not coalesce responses. 

The subsequent ladder diagrams illustrate management request / re- 
sponse behavior for valid MADs. The operations defined below assume 
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13.4.6.1.1 Get/GetResp 



13.4.6.1.2 Set/GetResp 



the receipt of a valid MAD. A MAD is valid if it satisfies all applicable vali- 
dation checks as specified in Section 13.5.3 MAD Validation on oaoe 605 . 



Get() requests the read of an attribute from a channel adapter, switch, or 
router. 

CI 3-10: In response to a valid Get(), the responder shall generate a Ge- 
tRespO which consists of one or more response MADs - the number is a 
function of class-specific requirements for the requested attribute. 

The attribute contained In the GetResp() is determined according to the 
specific MADHeaderMgmtClass and MADHeader:AttributelD in the re- 
quest 

Manager Node 



Get() 



GetRespO 



Responder responds 
to initiator with current 
attribute contents. 



Figure 137 Get 

Set() informs the recipient to set values maintained by the recipient ac- 
cording to the values contained in the attribute conveyed in MAD- 
HeaderData. 

C13-11: In response to a valid Set(), the responder shall generate a Get- 
RespO which consists of one or more response MADs - the number is a 

function of class-specific requirements for the requested attribute.^ 

C13-12: The attribute contained in the GetRespO shall be the same as in 
the associated request except as specified in CI 3-43: . The components 
of the attribute set shall be set equal to the equivalent values maintained 



1 . Exception: A Set() of Portlnfo:PortPhysicalState to Disabled does not require 
that a GetRespO be sent from the port which has been Disabled. 
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by the recipient after the set has been performed except where otherwise 
stated in class specific sections. 



Manager 



Node 



Set() 



GetRespO 



Responder sets an 
attribute, then responds 
to initiator with current 
attribute contents. 



13.4.6.1.3 Send 



Figure 138 Set 

Send() sends data from one entity to another on a class specific basis. If 
the class specific operations require reliability on top of the unreliable da- 
tagram service, higher level protocols may be defined based upon ex- 
changes of send type MADs. Such higher level protocols are class 
specific and are defined either in the class specific sections or other sec- 
tions referred to therein. 



Manager/Node 



Manager/Node 



SendO 



Figure 139 Send 
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13.4.6.1.4 TRAP 



TrapO indicates an event occurred at a channel adapter, switch, or router. 
See Section 1 3.4.9 Traps on page 594 for the specification of trap usage 
and behavior. 



Manager 



Node 



TrapO 



13.4.6.1.5 TrapRepress 



Figure 140 Trap 



TrapRepress instructs a trap sender to cease sending a trap It is currently 
sending. See Section 13.4.9 Traps on page 594 for a complete specifica- 
tion of traps and TrapRepress. 

The intended usage of TrapRepress() is shown below. 
Manager Node 

One or more 
instances of a given 
trap sent by a chan- 
nel adapter, switch, 

TrapRepressO 



13.4.6.1.6 Report/ReportResp 



Figure 141 Trap 

ReportO and ReportResp() MADs are used to forward traps directed to a 
GSM to parties who have subscribed for trap forwarding. The forwarding 
service is not available for subnet management class traps. See Section 
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13.4.11 Event Forwarding on page 597 for the complete specification of 
the event forwarding mechanism. 

Interested Host Class Manager Endnode 



Report(Notice) 



ReportRespO 



Trap(Notice) 



13.4.6.2 Timers and Timeouts 



Figure 142 Forwarding traps/notices from the class manager 

A management entity may use the IBA-defined management timeout and 
response time values to bound the amount of time a requester waits for a 
response or to bound the amount of time between successive IVIADs in a 
multiple MAD request, response, or message. 



13.4.6.2.1 PortInfo:SubnetTimeout 



13.4.6.2.2 RespTimeValue 



Portlnfo:SubnetTimeout specifies the maximum expected propagation 
delay, which depends upon the configuration of the switches, to reach any 
other port in the subnet from the port with which this instance of Portlnfo 
is associated. Requestors may use this value along with the appropriate 
RespTimeValue (below), to determine how long to wait for a response to 
a request before taking other action. 

The duration of time is calculated as 

4.096 microseconds * 2'^Portlnfo:SubnetTimeout. 

Traps are subject to maximum trap rate injection constraints based upon 
Portlnfo:SubnetTimeout. See 13.4.9 Traos on page 594 for the usage of 
Portlnfo:SubnetTimout with respect to traps. 



The IBA defined RespTimeValue specifies the expected maximum time 
interval between reception of an MAD and transmission of the associated 
response or between the associated port's transmission of successive 
MADs that are part of a multiple MAD sequence. Requestors may use this 
value along with the appropriate SubnetTimeout (above), to determine 
how long to wait for a response to a request, or, how long to wait for a suc- 
ceeding MAD In a multi MAD sequence, as described in Section i 3.4.6.3 
Timeout/Timer Usage on page 585 . The duration of time is calculated as 
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4.096 microseconds * 2'^RespTimeValue. 1 

2 

C13-13: The default RespTinneValue shall be 8. 3 

4 

C13-14: The RespTimeValue applicable to a given situation depends 
upon the operation being performed and the MAD sequences involved. 
The appropriate RespTimeValue shall be determined as follows: ^ 

7 

• If MADHeader:MgmtClass is Subn or Directed Route Subn the appli- 8 
cable RespTimeValue is conveyed by Port! nfo: RespTimeValue (see g 
14.2.5.6 Portlnfo on pace 633 for the definition of Portlnfo) of the rel 
evant port as identified below. 



13 
14 
15 
16 
17 



10 
11 

If MADHeader:MgmtClass is any other than Subn or Directed Route 12 
Subn, and the MADHeader:Method is not ReptResp(), the applicable 
RespTimeValue is conveyed by ClassPortlnfo:RespTimeValue (see 
13.4.8.1 ClassPortlnfo on page 589 for the definition of ClassPortln- 
fo) of the relevant port as identified below. 

• If the MADHeaderMethod is ReptResp(), the applicable RespTime- 
Value is conveyed by lnformlnfo:RespTimeValue (see Section 
13.4.8.3 Informlnfo on page 592 for the definition of Informlnfo) spec- 1 8 
ified by an event subscriber at the time of subscription. 1 9 

C13-15: In the case of MAD sequences other than Report(), ReptResp(), 20 

the port used to determine the applicable RespTimeValue shall be deter- 21 

mined as follows: 22 

23 

For MAD request-response exchanges consisting of a single packet 24 
request followed by a single packet response, the applicable RespTi- 
meValue associated with the responding port indicates the expected 
maximum interval between receipt of the request at that port and initi- 

ation of transmission of the corresponding response. 27 

For MAD request-response exchanges including a multipacket re- 
quest sequence followed by a response, the applicable RespTime- 29 
Value associated with the sending port indicates the expected 30 
maximum interval between initiation of transmission of successive 31 
packets in the multipacket request sequence. The applicable RespTi- 32 
meValue associated with the receiving port indicates the expected 33 
maximum interval between receipt of the last packet of the multipack- 
et request sequence at that port and initiation of transmission of the 
corresponding response. 

36 

For MAD request-response exhanges including a multipacket re- 
sponse, the applicable RespTimeValue associated with the respond- 
ing port indicates the expected maximum interval between receipt of 38 
the last packet of the request at that port and initiation of transmission 39 
of the response. The applicable RespTimeValue associated with the 40 

41 
42 
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responding port also indicates the expected maximum interval be- 1 
tween the initiation of transmission of successive packets in the multi- 2 
packet response sequence. 3 

For MAD request-response exchanges using the windowing protocol 4 

defined in 15.3.3 Reliable Multi-packet Protocol Description on page 5 

701 . the applicable RespTimeValue associated with the port originat- g 

ing the request indicates the expected maximum time within which ^ 
the requester will request more packets following the last packet in a 

burst of response packets. ^ 

• For operations requiring transmission of a sequence of multiple 
MADs not classified as requests (e.g. a succession of Send()s con- 
veying fragments of an SNMP frame), the applicable RespTimeValue 

associated with the port sending the sequence indicates the expect- 12 

ed maximum interval between initiation of transmission of successive 1 3 

packets in the sequence. 14 

Send(), TrapO, or TrapRepress() do not have an associated response 15 
MAD (Send() MADs exchanged as part of a higher level protocol are not 16 
request/response sequences in this context). As such, the IBA-defined 
management timeout and response times are not applicable. Note that 
while TrapRepressO may be sent as a result of the sending of a trap, 
TrapO ai^d TrapRepressQ are classified as messages not as requests or 
responses and do not constitute a request/response sequence. 

21 

1 3.4.6.3 Timeout/Timer Usage 22 

In general, the expected maximum time interval between transmission of 23 
a request and receipt of the associated response is 24 

25 

2*Portlnfo:SubnetTimeout + RespTimeValue 26 

27 

where RespTimeValue is determined according to Section 13.4.6.2.2 Re- 
spTimeValue on page 583 . 

29 

In general, the expected maximum time interval between reception of sue- 30 
cessive MADs in a multi MAD transfer is 31 

32 

Portlnfo:SubnetTimeout + RespTimeValue 23 

where RespTimeValue is determined according to Section 13.4.6.2.2 Re- 
spTimeValue on pace 583 . 

36 

If either expected maximum time interval is exceeded, the recipient of the 37 
MADs may consider the entire sequence invalid. The exact behavior in 33 
this case may be class-specific and possibly attribute-specific, but the re- 
cipient can always reclaim the resources used by the part of the sequence 
already received, and discard any MADs in the sequence that arrive later. 
In the case of the reliable transport protocol ( 15.3.3 Reliable Multi-packet 

42 
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13.4.6.4 TransactionID usage 



Protocol Description on oaae 701 ) retransmission of a subsequence may 
be requested. 

C13-16: For request/response sequences, timers shall be started for each 
request transmitted and reset upon arrival of the corresponding response 
MAD. 

C13-17: For multi MAD sequences, timers shall be reset then started 
upon arrival of each successive MAD in the sequence 



The contents of the TransactionID (TID) field are implementation-depen- 
dent. 

C13-18: When initiating a new operation, MADHeader: TransactionID shall be 
set to such a value that within that MAD the combination of TID, SGID, 
and MgmtClass is different from that of any other currently executing op- 
eration. Repeated Trap messages for the same event may be regarded 
as continuing a 'currently executing' operation as long as the trap can be 
repeated and no con'esponding TrapRepress has been received, or they 
may be regarded as initiating new operations . 

C1 3-19: Note that the above implies that recipients of messages shall use 
the combination of TID, SGID, and MgmtClass to uniquely associate mes- 
sages or message sequences, not just the TID. 

C13-20: When constructing a request that consists of sequence of MADs, 
requesters shall set MADHeader:TransactionlD in each MAD that is part 
of the sequence to an identical value. 

C13-21: When constructing a response, responders shall set MAD- 
Header: TransactionID in the response equal to MADHeader: Transac- 
tionID in the corresponding request. 

CI 3-22: Where a response is made up of multiple MADs, MAD- 
Header: TransactionID in each MAD in the response shall be set equal to 
MADHeader: TransactionID in the corresponding request. 

CI 3-23: Where an operation defined for an IBA management class re- 
quires a sender to send a succession of MADs of type message to effect 
the operation (e.g. an SNMP PDU being tunneled through IBA), the 
sender shall set MADHeadenTransactionID in each MAD that is part of 
the sequence to an identical value. 
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13.4.7 Status Field i 

All MADs contain a status field. The status field is used in MADs of type ^ 
response to convey information about the disposition of the request or 3 
conditions associated with disposition of the request. 4 

5 

The status field consists of 1 6 bits. The eight low order bits of the field are g 
used for indications common to all classes. The eight high order bits of the ^ 
field are used for class specific indications. Class specific status indica- 
tions are defined in the class specific sections of Chapter 14: Subnet Man- ^ 
agement on page 610 . Chapter 15: Subnet Administration on page 670 . 9 
and Chapter 16: General Services on page 717 . 1 0 

11 

01 3-24: For messages of type Response (see 13.4.5 Management Class -j 2 
Methods on page 577 ). the usage of the low order 8 bits shall be set as 
specified in Table 104 MAD Common Status Field Bit Values on page 587 . ^ ^ 

15 
16 

Table 104 MAD Common Status Field Bit Values 17 

18 

Name Bit Meaning 

0 Busy Temporarily busy. MAD discarded. This is not an error. 20 

21 

1 Redirect_required Redirection. This is not an error. 
22 

2-4 Code for invalid 0 - no invalid fields 22 
field 1 - The class version specified is not supported. 

2 - The method specified is not supported 

3 - The method/attribute combination is not supported 25 
4-6: Reserved 26 
7 - One or more fields in the attribute contain an invalid value. 27 

5-7 Reserved. 28 
29 

8-1 5 Class Specific The use of these bits is class specific. 
30 

013-25: For messages of type Request or type Message (see Section 31 
13.4.5 Management Class Methods on page 577 ). the entire status field 32 
shall be set to 0. 33 

34 

13.4.8 Management Class Attributes 35 

Attributes define the data which a management class manipulates. Each 35 
management class defines its own set of attributes. 3^ 

00 

Attributes are composite structures consisting of components typically 
representing hardware registers in channel adapters, switches, or routers. ^® 
Each attribute is assigned a unique Attribute ID. 40 

41 
42 
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Depending upon the attribute, components may be read only, read/write, 
or reserved. 

Some attributes have associated Attribute Modifiers (AMs) which further 
qualify or modify the application of the attribute. The use of the AM is at- 
tribute-specific and usage is defined where the attribute is defined. 

C13-26: When the AM is not used it shall be set to all zeroes. 

It is not possible to selectively set a single component within an attribute. 
A Get() must be performed to obtain the whole attribute, the single com- 
ponent must be modified in the result and a Set() must be performed to 
write the whole attribute. No atomicity is implied or provided in this se- 
quence of operations. 

CI 3-27: A given attribute shall have the same format for the Get(), Set() 
and GetRespO methods if used with those methods. 

There are three attributes which are common across multiple classes. 
Table 105 Attributes Common to Multiple Classes on page 588 lists each 
such attribute, its ID, and the classes where it is used. Attribute IDs less 
than 0x10 identify common attributes or are reserved. Attribute IDs equal 
to or greater than 0x1 0 identify attributes whose definitions are class spe- 
cific. The structure and content of the common attributes is defined in the 
following subsections. The structure and content of class specific at- 
tributes are defined in the respective class specific sections of Chapter 14: 
Subnet Management on page 610 . Chapter 15: Subnet Administration on 
page 670 . and Chapter 16: General Services on page 717 . 

The following common attributes are defined: 

Table 105 Attributes Common to Multiple Classes 



Attribute Name 


Attribute ID 


Attribute 
Modifier 


Description 


Where Used 




0x0000 




Reserved 




ClassPortlnfo 


0x0001 


0x00000000 


General and port- 
specific informa- 
tion for a GS man- 
agement class 


The SA class and all 
supported GS classes 
on channel adapters, 
switches, and routers 
(See 13.4.8.1 Class- 
Portlnfo on oaae 589). 


Notice 


0x0002 


0x00000000- 
OxFFFFFFFF 


Information regard- 
ing the associated 
Notice (or Trap in 
which case the 
Attribute Modifier 
shall be 0) 


All classes supporting 
traps/notices. (See 
13.4.8.2 Notice on 
Dace 591). 
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Table 105 Attributes Common to Multiple Classes 1 
2 

Attribute Name Attribute ID Modifier Description Where Used 3 

Informlnfo 0x0003 0x00000000 Event Subscription All classes having a 5 

class manager sup- ^ 
porting event sub- 
scription. (See 7 
13.4.8.3 Informlnfo on 
page 592 ). 



8 
9 

0x0004-0x000F Reserved ^ q 

0x001 0-OxFFFF Class-dependent Usage of values in 11 

values. this range is class -| 2 
specific and is speci- 
fied in the class spe- ^ ^ 
cific sections of 14 
Chapter 14: on page ^ g 
610 . Chapter 15: on 

page 670 . and Chap- ^ 6 

ter 16: on page 717 . 17 

13.4.8.1 ClassPortInfo 

CI 3-28: Channel adapters, switches, and routers implementing a man- 20 

agement class shall implement ClassPortInfo according to the definition 21 
specified in Table 106 ClassPortInfo on page 590 . 

CI 3-29: The ClassPortInfo attribute shall be implemented for every GS 23 

class supported by a channel adapter, switch, or router. 24 

25 

C13-30: The ClassPortInfo attribute shall be implemented for the SA class 26 

by any channel adapter, switch, or router on which an SA is located. 27 

28 

The presence of ClassPortInfo for a management class confirms the 
availability of that management class on a particular channel adapter, 

switch, or router and provides information about the version of MADs sup- 30 

ported by the class on that channel adapter, switch, or router. (Note: sup- 31 

port for a given class can be determined directly from capability bits in 32 

Portlnfo for the port in question. See 14.2.5.6 Portlnfo on page 633 ). 33 



34 
35 



The ClassPortInfo attribute also provides port-specific information for 
class services on a channel adapter, switch, or router. In addition to being 

available as the object of a Get() method specifying it as the target, Class- 3^ 

Portlnfo is also returned as the result of any Get() or Set() if the requester 37 

is being redirected as described in Section 13.5.2 GSI Redirection on 38 

page 603 . 39 



ClassPortInfo contains information related to general services traps. If 
sending trap messages is supported on a channel adapter, switch, or 



40 
41 
42 
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router, and if trap sending is enabled for this port (nonzero TrapLID), 
ClassPortlnfo defines the destination to which traps for the subject GS 
class applying to this port are to be sent. See Section 13.4.9 Traps on 
page 594 . Note that this applies only to general services traps. Subnet 
management traps do not use this mechanism. 

For both redirection and traps, ClassPortlnfo provides support for cross- 
subnet communications by including the information necessary to build a 
properly formed GRH, see Sections 1 3.4.9 Traps on page 594 and 13.5.2 
GSI Redirection on page 603 . 

Table 106 ClassPortlnfo 



Component 


Access 


Length 


Offset 
fbits) 


Description 


BaseVersion 


RO 


8 


0 


Current supported MAD Base Version. Indicates that this channel 
adapter, switch, or router supports up to and including this version. 


ClassVersion 


RO 


8 


8 


Current supported pnanagernent class version. Indicates that this chan- 
nel adapter, switch, or router supports up to and including this version. 


CapabilityMask 


RO 


16 


16 


Supported capabilities of this management class, bit set to 1 for affirma- 
tion of management support. 

Bit 0 - If 1 , the management class generates Trap() MADs 

Bit 1 - If 1 , the management class implements Get(Notice) and 

Set(Notice) 

Bit 2-7: reserved 

Bit 8-15: class-specific capabilities. 


Reserved 


RO 


27 


32 


Reserved 


RespTimeValue 


RO 


5 


59 


See 13.4.6.2 Timers and Timeouts on oaae 583. 


RedirectGID 


RO 


128 


64 


The GID a requester use as the destination GID in the GRH of mes- 
sages used to access redirected class services. If redirection is not 
being perfomried, this shall be set to zero. 


RedirectTC 


RO 


8 


192 


The Traffic Class a requester shall use in the GRH of messages used to 
access redirected class services. For more on the definition and signifi- 
cance of traffic class see 8.2.2.3 Service Levels on oaae 189 and 8.3.2 
Traffic Class fTCIass) - 8 bits on paae 191 


RedirectSL 


RO 


4 


200 


The SL a requester shall use to access the class services. 


RedirectFL 


RO 


20 


204 


The Flow Label a requester shall use in the GRH of messages used to 
access redirected class services. 


RedirectLID 


RO 


16 


224 


The DLID a requester shall use to access the class services. 


RedirectP_Key 


RO 


16 


240 


The P_Key a requester shall use to access the class services. 


Reserved 


RO 


8 


256 


Reserved 


RedirectQP 


RO 


24 


264 


The QP a requester shall use to access the class services. Zero is ille- 
gal. 
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Table 106 ClassPortlnfo 



Component 


Access 


Length 
(bits) 


UTTSet 

(bits) 


Description 


RedirectQ_Key 


RO 


32 


288 


The Q_Key associated with the RedirectQR This Q_Key shall be set to 
the well known Q_Key. 


TrapGID 


RW 


128 


320 


The GID to be used as the destination GID in the GRH of trap messages 
originated by this service. If all zeroes, no GRH is inserted in trap mes- 
sages. 


TrapTC 


RW 


8 


448 


The Traffic Class to be placed in the GRH of trap messages originated 
by this service. For more on the definition and significance of traffic class 
see 8.2.2.3 Service Levels on pace 189 and 8.3.2 Traffic Class (TCIass) 
- 8 bits on paae 191. 


TrapSL 


RW 


4 


456 


The SL that shall be used when sending trap messages originated by 
this service 


TrapFL 


RW 


20 


460 


The Flow Label to be placed in the GRH of trap messages originated by 
this service. 


TrapLID 


RW 


16 


480 


The DLID to where trap messages shall be sent by this service. If all 
zeroes, traps shall not be sent from this port. 


TrapP_Key 


RW 


16 


496 


The P_Key to be placed in the header for traps originated by this ser- 
vice. 


TrapHL 


RW 


8 


512 


The Hop Limit to be placed in the GRH of trap messages originated by 
this service. This specifies the maximum number of routers through 
which the message containing the GRH specified here may pass. The 
default value is 255. 


TrapQP 


RW 


24 


520 


The QP to which trap messages originated by this service traps shall be 
sent. Must not be zero. 


, TrapQ_Key 


RW 


32 


544 


The Q_Key associated with the TrapQP. This Q_Key shall have the high 
order bit set. See 10.2.4 Q Kevs on oaae 376 for a descriotion of the 
significance of setting the high order bit. 
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13.4.8.2 NOTICE 



The Notice attribute describes an exception or other ctiannel adapter, 
switcli, or router event. It is used by botli ttie trap mechanism described 
in 13.4.9 Traps on page 594 and the Notice mechanism described in 
13.4.10 Notice Queue on page 596 . 
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o13-1: Channel adapters, switches, and routers implementing Notice at- 
tributes shall conform to the definition specified in Table 107 Notice on 
page 592 . 

Table 107 Notice 



Component 


Access 


Length 
(bits) 


Offset 
(bits) 


Description 


IsGeneric 


RO 


1 


0 


If set to 1 , notice Is generic, else is vendor specific 


Type 


RO 


7 


1 


Enumeration indicating type of trap/notice: 

0 - Fatal 

1 - Urgent 

2 - Security 

3 - Subnet Management 

4 - Informational 
5-0x7F - Reserved 


NodeType / 
VendorlD 


RO 


24 


8 


If generic, indicates Node Type: 

1 - Channel Adapter 

2 - Switch 

3 - Router 

0, 4-UXrrrrrr - KeserveO 

If not generic, indicates the 24 bit IEEE OUI assigned to the ven- 
dor. 


TrapNumber/ 
Device! D 


RO 


16 


32 


If generic, indicates a class-defined trap number. Number OxFFFF 
is reserved. 

If not generic, this is Device ID information as assigned by device 
manufacturer. 


IssuerLID 


RO 


16 


48 


LID of issuer. 


NoticeToggle 


RO 


1 


64 


For Notices, alternates between zero and one after each Notice is 
cleared. See Section 13.4.10 Notice Queue on pace 596. 
For Traps, this shall be set to 0. 


NoticeCount 


RO 


15 


65 


For Notices, indicates the number of notices queued on this chan- 
nel adapter, switch, or router See Section 13.4.10 Notice Queue 
on Dace 596. 

For Traps, this shall be set to all zeroes. 


DataDetails 


RO 


432 


80 


If generic, data details is disambiguated by management class 
and TrapNumber. Othenwise disambiguation is vendor defined. 



13.4.8.3 iNFORMlNFO 



No mechanism is provided to clear the Notices queue without retrieving 
Notices first. 



The Informlnfo attribute provides information for subscribing to a class 
manager for event forwarding. See l 3.4.11 Event Forwarding on oaae 
597. 
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o13-2: Channel adapters, switches, and routers implementing the Inform- 
Info attribute shall conform to the definition specified in Table 108 Inform- 
Info on page 593 . 



Table 108 Informlnfo 



Component 


Type 


Length 
(bits) 


Offset 
(bits) 


Description 


GID 


RW 


128 


0 


specifies specific GID to subscribe for. Set to all 
zeroes if not desired. See Section 13.4.11 Event For- 
wardina on paae 597. 


LIDRangeBegin 


RW 


16 


128 


Specifies the lowest LID in a range of LID addresses 
to subscribe for. 

Address OxFFFF denotes all LID addresses. 


LIDRangeEnd 


RW 


16 


144 


Specifies the highest LID in a range of LID addresses 
to subscribe for. Set to 0 if no range desired. Ignored 
if LIDRangeBegin is OxFFFF. 


P_Key 


RW 


16 


160 


The partition key to use. 


IsGeneric 


RW 


8 


176 


If set to 1 , fonward generic traps. 

If set to 0, forward all vendor specific traps. 

Values above 1 are undefined. 


Subscribe 


RW 


8 


184 


If set to 1, subscribe 
If set to 0, unsubscribe. 
Values above 1 are undefined. 


ClassRange 


RW 


16 


192 


Enumeration indicating class of trap/notice. Valid val- 

0 - Fatal 

1 - Urgent 

2 - Security 

3 - Subnet Management 

4 - Informational 
OxFFFF - forward all 


DevicelD/TrapNumber 


RW 


16 


208 


If not generic, this is device ID information as 
assigned by device manufacturer. If generic, indicates 
trap number. Number OxFFFF means forward any 
Device ID/ TrapNumber. 


RespTimeValue 


RO 


32 


224 


See Section 13.4.6.2.2 ResoTimeValue on oaae 583. 


Reserved 


RO 


8 


256 
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Table 108 Informlnfo 



Component 



Type 



Length 
(bits) 



Offset 
(bits) 



Description 



Vendor! D / NodeType 



RW 24 264 If generic, indicates Node Type: 

1 - Channel Adapter 

2 - Switch 

3 - Router 

0, 4-OxFFFFFF - Reserved 

If not generic, indicates the 24 bit IEEE QUI assigned 
to the vendor. 



13.4.9 Traps 



Traps are asynchronous notifications for the purpose of alerting an entity 
within another channel adapter, switch, or router about exception condi- 
tions or other events of interest at a given channel adapter, switch, or 
router within the subnet. 

C13-31 : Either Trap support, or Notice support, or both shall be provided 
by all managers. 

See 13.4.10 Notice Queue on oaae 596 for a description of Notice sup- 
port. 

For the subnet management class, traps originate at a CA, switch, or 
router and are always sent to the master subnet manager managing the 
originator. The managing SM is always on the same subnet since cross 
subnet communications is not allowed for subnet management class 
MADs. Subnet management traps are always allowed and there is no di- 
rect mechanism for preventing a CA, switch, or router from sending traps 
(note, there may be actions which as a side effect cause cessation of 
traps, i.e. downing a port, but these are not considered direct mecha- 
nisms). 

For each GS management class, whether traps are allowed to be sent for 
that class is specified in ClassPortlnfo for that class. 

CI 3-32: If ClassPortlnfo:TrapDLID for a particular port and class is zero, 
traps shall not be generated from that port for that class. 

If traps are allowed to be generated, the destination address and certain 
other necessary message parameters are obtained from ClassPortlnfo. 
The destination for traps may be in the same subnet or in another subnet. 
If traps are to be delivered to a destination not in the same subnet, the 
ClassPortlnfo:TrapGID is non zero and a GRH is required in the trap mes- 
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sage. If ClassPortlnfo:TrapGID is zero, the trap destination is within the 1 
same subnet and no GRH is included in the message. 2 

3 

If a GRH is required, the source GID in the GRH is the GIDIndexO of the ^ 
originating port and other fields in the GRH are either fixed or are given 
values drawn from counterpart fields in ClassPortlnfo. 8.3 Global Route 
Header on page 191 specifies the requirements applicable to GRHs. ^ 

7 

Traps are always sent to the destination identified in ClassPortlnfo and 8 
only to that destination. It is the responsibility of any entity that sets Class- g 
Portlnfo fields to assure that the values programmed are consistent. That >, q 
is, the combination of GID/DLID, P__Key, port, etc. must be consistent with ^ ^ 
addressing of and access rules applicable to the specified port. 

Traps may be issued by any channel adapter, switch, or router on the ^3 
subnet. Channel adapters, switches, and routers may repeat sending of a 14 
TrapO 15 

16 

o13-3: Channel Adapters, switches and routers shall not send traps at a 
rate greater than the trap rate limit specified. For a given port, the trap rate 
limit shall be defined as the reciprocal of the time duration determined 
from Portlnfo:SubnetTimeout for that port. See Section 13.4.6.2.1 Port- 9 
lnfo:SubnetTimeout on page 583 . 20 

21 

o13-4: Traps shall contain the Nofice attribute to identify the trap. The No- 22 
tice attribute is described in 13.4.8.2 Notice on page 591 . 23 

24 

o13-5: Trap originators shall use the same MADHeaderTransactionID 
value for all instances of repeated traps. 

26 

Recipients of traps may send TrapRepress() MADs to trap originators. 27 

28 

o13-6: Upon receipt of a valid TrapRepress() MAD, the trap originator 29 
shall cease sending the trap which matches the trap identified by the Tra- 
pRepressO MAD. A trap being repeatedly sent matches a trap identified 
in a TrapRepressO MAD when both MADHeaderTransacfionID in the trap 
MAD matches MADHeaderTransactionID in the TrapRepress MAD and 
the Notice attribute in the trap MAD matches the Notice attribute in the 33 
TrapRepress(). 34 

35 

o13-7: If a TrapRepressO is received and no matching trap is being sent, 
the TrapRepressO shall be silently dropped and no other action taken. 



37 



Sending traps is optional, unless specified otherwise by a compliance 38 
statement. 39 



40 
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13.4.10 Notice Queue 



Management classes may choose to supplement trap handling through 
the use of the event fonwarding mechanism described in Section 13.4.11 
Event Forwarding on page 597 . 

Although Traps and the Notice Queue (NQ) mechanism, Section 13.4.10 
Notice Queue on pace 596 . use the same Notice attribute to describe 
events or conditions, the trap mechanism and the notice queues mecha- 
nism are completely independent. There is no requirement that a Notice 
queue entry be generated when a Trap is sent or vice versa. 



The NQ is a repository for storing Notice attributes associated with the oc- 
currence of an event or the detection of a condition at a channel adapter, 
switch, or router. Notices in the NQ are queried or deleted using Get(No- 
tice) and Set(Notice) methods. 

o13-8: The Notice Queue shall operate as a first in first out queue. 

The information in the NQ is represented by the Notice attribute. See 
1 3.4.8.2 Notice on page 591 . For Get(Notice), the AM is an index into the 
set of notices stored. A Get(Notice) with an AM of 0 selects the oldest no- 
tice saved, that is, the Notice on the top of the queue, with successive (in- 
crementing) values of AM selecting successively newer notices. 

o13-9: Notice:NoticeCount in a returned Notice attribute shall always in- 
dicate the number of notices currently on the queue. Performing a 
Get(Notice) does not remove a Notice from the NQ. Notice: NoticeCount 
includes the notice returned in response to the Get(). 

o13-10: If AM is greater than the number of notices queued, or if the 
queue is empty, the Notice attribute returned shall be an empty Notice and 
NoticeiNoticeCount shall contain the number of notices on the queue. 
Otherwise the recipient of a Get(Notice) shall return a copy of the selected 
notice in the Notice attribute in the response unless the queue is empty. If 
the queue is empty an empty Notice attribute shall be returned. 

To clear notices from the head of the Notice Queue, the requester sends 
a Set(Notice) MAD containing a Notice attribute with: 

• Notice:NoticeToggle set to match Notice:NoticeToggle in the 
channel adapter's, switches', or router's Notice attribute. 

• Notice: NoticeCount set to the number of notices to delete. 

013-11: Upon receiptof a Set(Notice) MAD, if the recipient implements an 
NQ, the recipient shall perform the following actions: 
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13.4.11 Event Forwarding 



• If the NoticeToggle value in the Set(Notice) does not match the 
NoticeToggle value In the Notice attribute on the channel adapter, 
switch, or router, the Set(Notice) is silently discarded and no oth- 
er action is taken. 

• The oldest notice and successively newer notices up to a total 
number of notices indicated by Notice: NoticeCount in the Notice 
attribute contained in the Set(Notice) shall be deleted. If No- 
tice: NoticeCount in the request Notice is greater than the number 
of notices on the queue, the queue is emptied. 

• The response to the next Get(Notice) request shall return a No- 
tice attribute that corresponds to the new top of the queue and 
Notice:NoticeCount in the response shall reflect the updated 
count of notices on the queue. 

The types and number of notices captured by a channel adapter, switch, 
or router is implementation-dependent. The actual size of the Notice 
Queue is implementation specific and is not specified by the architecture. 
Behavior of a full Notice Queue when the channel adapter, switch, or 
router has another notice to queue is undefined. 

Channel adapters, switches, and routers are not required to support the 
NQ mechanism. 



Nodes can request that specific traps sent to a class manager by a given 
channel adapter, switch, or router be forwarded to them by subscribing for 
traps from channel adapter, switch, or router. This is done via the event- 
forward subscription mechanism. To subscribe, an interested host sends 
a Set(lnformlnfo) request to the class manager identifying the channel 
adapter, switch, or router it wishes traps to be forwarded from by its GID, 
LID, or identifying a set of channel adapters, switches, or routers whose 
LIDs are in a specified LID range. The class manager responds with a Ge- 
tResp(lnformlnfo) message to confirm or deny such fonwarding. 

o13-12: A manager confirming a request for event subscription shall re- 
spond with lnformlnfo:Subscribe set to 1. 

013-13: A manager denying a request for event subscription shall re- 
spond with lnformlnfo:Subscribe set to 0. 

Requestors wishing to subscribe to event forwarding may determine 
which managers exist and their locations on the fabric by querying the SA. 
See 15.4 Operations on page 706 for a discussion of subnet administra- 
tion including restrictions on access to and use of the SA. 
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The exchange of MADs to effect subscription to event fonwarding Is de- 
picted in Figure 143 Subscribing and unsubscribing for forwarding on 
page 598 below. 

Interested Host Class Manager Endnode 
Set(lnformlnfo) 



GetResp(lnformlnfo) 



Figure 143 Subscribing and 
unsubscribing for forwarding 



o13-14: Managers receiving a Set(lnformlnfo) shall verify that the re- 
questor originating the Set(lnformlnfo) and a trap source identified by In- 
formlnfo:GID, lnfornnlnfo:LIDRangeBegin, or by a LID included in the 
range lnformlnfo:LIDRangeBegin-lnformlnfo:LIDRangeEnd are permitted 
to access each other according to the current partitioning. The manager 
shall perform verification by verifying that a valid path exists between the 
requestor and the trap source. 

This verification can be accomplished by requesting a path between those 
two using a SA query operation. If such a path exists, the Set(lnformlnfo) 
succeeds. If such a path does not exist, the LID is invalid as a trap source. 
A LID specifies an invalid trap source if it is not assigned or if the associ- 
ated port is not accessible to the requestor under current partitioning. 

013-15: If partition verification fails on Set(lnformlnfo), the manager re- 
ceiving the request shall indicate in the response that the operation failed 
with an invalid attribute status value as defined in Section 13.4.7 Status 
Field on page 587 and Table 104 MAD Common Status Field Bit Values 
on page 587 ). 

o13-16: If lnformlnfo:LIDRangeBegin-lnformlnfo:LIDRangeEnd specifies 
a range, all LIDs in the range shall be checked for validity as trap sources. 
If all are valid or invalid, operation shall be the same as if a single trap 
source was specified. If lnformlnfo:LIDRangeBegin-lnformlnfo:LIDRan- 
geEnd specifies a range including one or more LIDs not valid as trap 
sources, the manager shall either indicate the entire request is invalid, 
using the invalid attribute status as above; or shall use a class-specific 
means to indicate that some of the LIDs were associated with valid trap 
sources and others were not, identifying the invalid ones using the invalid 
attribute status. 
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013-17: Managers which have confirmed a request for event subscription 1 
shall forward corresponding events to the subscriber via the Report(No- 2 
tice) MAD. 3 



13.5 MAD Processing 



The recipient of a Report(Notice) datagram should respond with a Repor- 
tRespO MAD with the same transaction ID as issued by the class man- 
ager and an empty Notice attribute. 

Figure 144 Fon/vardlno traps/notices from the class manager on page 
599 . depicts the MAD exchange associated with forwarding traps to a sub- 
scriber. The report response provides means for the class manager to as- 
sure that subscribers receive reports assuming communications with the 
subscriber is not failed. 

Interested Host Class Manager Endnode 



Report(Notice) 



ReportRespO 



Trap(Notice) 



Figure 144 Forwarding traps/notices from the class manager 



This forwarding service is not available directly from the Subnet Manager 
for Subnet Management traps. The SA must be used, see Chapter 1 5: 
Subnet Administration. 



Non redirected MADs are distinguished from other packets by the desti- 
nation queue pair specified in the packet. Two specific queue pair num- 
bers are dedicated to supporting non redirected management operations. 
Each of the dedicated queue pairs represents a unique interface to one or 
more management services. These interfaces and the behaviors related 
to associated services are specified in subsequent sections. 

If redirection is in effect, redirected MADs may be directed to a queue pair 
different from either of the dedicated queue pairs. How a management 
service is associated with such a queue pair is implementation specific. 
Such MADs are standard packets and MADs arriving at a port are directed 
according to the standard procedures for directing packets to queue pairs. 
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13.5.1 MAD Interfaces 



Two required interfaces to management entities are specified based upon 
two well known queue pairs. These are known as the Subnet Manage- 
ment Interface and the General Service Interface. 

Figure 145 MAD Interface on page 600 . below, depicts the general rela- 
tionships among management entities and their interfaces to the wire. 
Note, the figure itself is meant to be representative of a basic channel 
adapter, switch, or router and is not meant to imply a specific implemen- 
tation or to imply specific requirements or limitations. 



Managed node 

r — — — — 



I 



Subnet Manager 



I 



Subnet Mgmt Agent 
LManagebi — 



Vendor-specific Agent 



Device Mgmt Agent 



Baseboard Mgmt 



SNMP Tunneling 



Subnet Administra- 



Performance Mgmt 



Comm. Mgmt Agent 



I 
I 
I 
I 



Managed- 



SMI 



I GST 



"SMPs" 



"GMPs" 



I VL 1 5 



I VT ,0-14 Z] 



Note: A channel adapter, switch, or router nfiay or may not contain a subnet manager. 
If a CA, switch or router does contain a subnet manager, the specific relationships 
between the SM, the SMA, and the SMI are implementation specific. 

Figure 145 MAD 
Interface 
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C13-33: For each IBAportonan IBAChannel Adapter or router, or for port 1 
0 on an IBA switch, Subnet Management MADs to be processed at that 2 
port shall be destined to Queue Pair 0. 3 

4 

C13-34: For each IBAportonan IBAChannel Adapter or router, or for port 
0 on an IBA switch, unless redirected (see Section 13,5.2 GSI Redirection 
on page 603 ). SA or GS MADs to be processed at that port shall be des- ^ 
tined to Queue Pair 1. 7 

8 

Queue pairs 0 and 1 have unique semantics with respect to processing of g 
messages specifying one of them as the destination queue pair. Imple- 
mentations of QP 0 and 1 are not required to follow the semantics asso- 
ciated with other queue pairs with respect to requirements such as posting 
and consumption of WQEs, manipulation of an associated completion 
queue, and so on. Messages arriving at QP 0 or QP 1 are processed in 13 
accordance with the requirements set forth in this section and following 14 
Sections: 13.5.3.1 MAD validation for subnet management MADs on page 15 
605 . 13.5.3.2.1 MAD validation at the GSI on page 606 . and 13.5.3.2.2 
MAD validation at the SA and GSAs on page 607 . 



17 
18 
19 
20 



13.5.1.1 Processing Subnet Management Packets (SMPs) 

The Subnet Management Interface (SMI) is associated with QP 0. QP 0 
is used exclusively for sending and receiving subnet management MADs, 
Communications with the SMA in a channel adapter, switch, or router is 
always through the SMI. If a channel adapter, switch, or router hosts a SM, 22 
then communications between that SM and the SMA of each channel 23 
adapter, switch, or router in the subnet is also through the SMI. Only 24 
SMAs and SM communicate through this interface. No other entities may 25 
do so. 26 

The MADs of subnet management class are called SMPs. 

28 

CI 3-35: SMPs shall not travel beyond the boundaries of a subnet (i.e. 29 
through a router). 30 

31 

MADs with a destination queue pair of 0 are validated according to the ^2 
rules specified in section 1 3.5.3.1 MAD validation for subnet management 
MADs on page 605 . 

34 

Validated MADs arriving for QPO are handled by the SMI. It is not specified 35 
how the SMI dispatches the SMPs between the SMA and a possible SM. 36 

37 

CI 3-36: On an HCA, SMPs not dispatched to the SMA shall be posted to 33 
the QPO queue pair exposed above the verb layer. 

C1 3-37: For SMPs dispatched to the SMA, a vendor shall '^^ 

41 

42 



InfiniBand^"*" Trade Association 



Page 601 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Management Model 



October 24. 2000 
FINAL 



where posting is with respect to the QPO queue pair exposed above the 
verb layer. 



On an HCA, the GSI is only aware of agents residing below the verb layer. 



either never post such SMPs, 1 

or, always post such SMPs, 2 

3 

or, offer a vendor specific option to select whether such SMPs are 
never posted or are always posted, ^ 

5 

6 
7 

13,5.1.2 Processing General Services Management Packets (GMPs) 8 

The General Services Interface (GSI) is associated with QP 1 . QP 1 is re- ^ 

served exclusively for subnet administration and general services MADs. 1 0 

Unless redirected, GSAs send and receive MADs by means of the GSI. 11 

For a description of redirection see Section 13.5.2 on page 603. De- 12 

pending upon implementation, the GSI may also provide the interface ^3 
through which a class manager communicates with corresponding (class 

specific) GSAs throughout the fabric. ^ ^ 

The MADs defined for subnet administration and general services are re- 16 

ferred to as GMPs. The GSI acts as a demultiplexer for GMPs, distributing 1 7 

messages destined for QP 1 to the appropriate service agent or class 18 

manager, based upon MADHeaderMgmtClass in the MAD header. MADs ^ g 

with a destination queue pair of 1 are validated according to the rules 20 
specified in section 13.5.3.2.1 MAD validation at the GSI on page 606 . 

In those cases where the GSI provides an interface for both a class ser- 22 

vice agent and the corresponding class manager, the determination of the 23 

appropriate destination above the GSI demultiplexing is implementation 24 

dependent. 25 

26 
27 

C13-38: GMPs dispatched to agents implemented below the verb layer ^8 
shall not be visible above the verb layer. 29 

30 

C13-39: GMPs that are not dispatched to agents implemented below the 31 
verb layer shall be visible above the verb layer as posted to the QP1 ex- 32 
posed above the verbs. 

The SL used by a GMP is neither specified nor constrained by virtue of 
the fact it is a GMP. The choice of SL is outside of the scope of these sec- 35 
tions. Note that unlike SMPs which follow special and unique VL rules, 36 
GMPs are standard unreliable datagrams subject to and only to the SLA/L 37 
usage rules applicable to all unreliable datagrams. 38 

39 

If redirection has been configured for a management class, GMPs des- 
tined to the QP specified in the redirection are treated exactly the same 
as any other unreliable datagram. Since the destination QP is not QP 1 , 

42 
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13.5.2 GSI Redirection 



they do not appear at the GSI but are delivered directly to the QP specified 
in the redirection by the IB transport in the same manner as any other un- 
reliable datagram. 

C13-40: GSAs that are accessed using redirection shall validate arriving 
MADs according to the same rules as apply for queue pair 1 . 

GMPs may contain a GRH and may be forwarded across subnet bound- 
aries. Whether or not a given class manager supports cross subnet com- 
munications with corresponding class service agents is implementation 
dependent. 

Table 109 Management Interfaces Summarv on page 603 summarizes 
the properties associated with the above described management inter- 
faces. 

Table 109 Management Interfaces Summary 





Subnet Management Interface 


General Services Interface 


Queue Pair 


QPO 


QP1 


VL 


VL 15 


not VL 15 


Partitioning 


not enforced 


enforced 


Q_Key 


not enforced 


enforced 

(Q_Key = 0x8001 _0000) 


Scope 


Within subnet only 


Routable across subnets 


Class Key 


Management Key (M_Key) 


class dependent 



By default, the interface from the wire to class service agents is the GSI. 
A mechanism is provided by which the interface to a given class service 
agent may be relocated to another queue pair. This mechanism is called 
redirection and is specified in detail below. The SA as well as each GSA 
may individually support this mechanism or not. The ClassPortlnfo at- 
tribute is used to indicate if redirection is supported, and, if so, contains 
redirection information for MADs of the subject class. 

C13-41 : If, for a class, redirection is not being used, any GMP destined to 
the associated class agent via QP 1 shall be processed by that agent. 

C13-42: For any request sent to QP 1 with MADHeaderMgmtClass equal 
to the class value of a class being redirected, a response shall be returned 
containing ClassPortlnfo for the class specified in the request 

C13-43: The Status field in a response including ClassPortlnfo because 
of redirection shall have the MADHeadenStatusField redirection-required 
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bit set indicating that a ClassPortlnfo attribute was returned rather than 
the expected attribute. 

A response with the MADHeader:Status:RedirectionRequired bit set indi- 
cates that the request was not performed and that the request must be is- 
sued to the alternate interface specified in ClassPortlnfo. 

Redirection may be used at any time, so requesters should always be pre- 
pared to be redirected. 

It is permissible for different requesters for the same management class 
on a channel adapter, switch, or router to be redirected to a different inter- 
face. The redirection operation is depicted in Figure 146 GSI Redirection 
on page 604 . 

Redirection information is also available by doing a normal Get() speci- 
fying the class of interest in the MADHeadenMgmtClass field of the MAD 
header and ClassPortlnfo as the attribute. 

The ClassPortlnfo attribute contains all of the information necessary to ac- 
cess the redirected service either from within the same subnet or from a 
different subnet. ClassPortlnfo may be programmed to include all of the 
parameters a source needs to form a complete GRH. 

013-44: A GRH shall be included in redirected class messages only if the 
RedirectGID component of ClassPortlnfo is non zero. 

It is the responsibility of any entity programming ClassPortlnfo to assure 
that the parameters provided for accessing redirected services are con- 
sistent with address and access controls applicable to the redirected ser- 
vice. 

The ClassPortlnfo attribute is described in 13.4.8.1 ClassPortlnfo on page 
589 . The Status field is described in 13.4.7 Status Field on page 587 . 

Manager Node 



Set(), Get() 



GetRespO with appropri- 
ate Status value set and 
containing the ClassPort- 
lnfo attribute 



Responder redirects 
the Requester to an 
alternate 
QP/DLID/GID/SL 



Figure 146 GSI Redirection 
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1 

2 

13.5.3 MAD Validation 3 

Packets arriving at a port of a channel adapter, switch, or router are vali- 4 

dated according to the validation rules specified in 7.4 Data Packet Check 5 

on page 141 and 9.6 Packet Transport Header Validation on page 228 . g 

Only packets so validated are delivered to management entities. The con- ^ 
tents of the data payload are further validated by management entities to 

validate that the data payload contains a valid MAD. ^ 

9 

Valid MADS are delivered to appropriate management entities for pro- 10 

cessing. 11 

12 
13 

CI 3-45: Data payloads arriving at the SMI (QP 0) shall be validated as in- ^4 
dicated in the bulleted list below. Packets failing one or more of these 
check are discarded and no action is taken in response unless the method 
is a Get() or a Set(). For Get() and Set() methods, the return of a Get- ^ ° 
RespO when validation has failed is optional. ^ ^ 

18 

• The data payload length must be 256 bytes 19 

• LRH:VLmustbe 15 20 

• BTH:QP must be 0 

22 

• BTHiOpCode must be Send only UD 23 

MADHeaderBaseVersion must be 1 24 

MADHeadenMgmtClass must specify a class of Subn or Directed 25 
Route Subn 26 

MADHeadenAttributelD must specify an attribute supported by 27 
the class specified in MADHeader:MgmtClass. 28 

0I3-I8: If a GetRespO is returned, any conditions detected that have cor- 29 
responding codes assigned in Table 104 MAD Common Status Field Bit 30 
Values on page 587 shall be reflected by corresponding settings of bits in 31 
MADHeadenStatusField in the response. 32 

33 
34 

If a channel adapter, switch, or router supports a subnet manager, some 35 
MADs may be destined for the SMA while others may be destined for the 36 
SM. The discrimination between the SMA as a destination and the SM as 37 
a destination is based on the class, the method, and the attribute. See 33 
14.2 Subnet Management Class on page 611 . Table 110 SM MAD 



Sources and Destinations on page 606 indicates which SMPs originate at 
an SM, which SMPs originate at an SMA, and which SMPs may be des 
fined to SMAs or to SMs. 41 
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Table 110 SM MAD Sources and Destinations 



UlAn Tuna 

iviML^ lypc 


ouurce 


i^coiindiiun 


Notes 


Cjet( ) 


CM 
olVI 


oMM 


Applies Tor an anriuuies except oMinTo 


Get(SMInfo) 


SM 


SM 


Applies only for the SMInfo attribute 


Set(*) 


SM 


SMA 


Applies for all attributes except SMInfo 


Set(SMInfo) 


SM 


SM 


Applies only for the SMInfo attribute 


GetResp(*) 


SMA 


SM 


Applies for all attributes except SMInfo 


GetResp(SMInfo) 


SM 


SM 


Applies only for the SMInfo attribute 


TrapO 


SMA 


SM 


Applies to all subnet nnanagennent traps. 



This specification does not require that subnet managers be implemented 
in any particular way. It does require that subnet managers be able to orig- 
inate and receive subnet management MADs. Implicitly, however an SM 
is realized, the mechanisms used must provide for the SM implementation 
to cause packets to be sent and received that have QP 0 as the source 
and destination QPs and which will be transmitted and received on VL 15. 

C13-46: The SMI shall handle all Directed Route SMPs as described in 
14.2.2 SMPs and Directed Route Algorithm on pace 614 . 

1 3.5.3.2 Mad validation for Subnet administration an General Services 

13.5.3.2.1 MAD VALIDATION at the GS! 

CI 3-47: Data payloads arriving at the GSI (QP 1) shall be validated as 
specified in the bulleted list below. Packets failing one or more of these 
checks are discarded and no action is taken in response unless the 
method is a Get() or a Set(). For Get() and Set() methods, the return of a 
GetRespO when validation has failed is optional. 

The data payload length must be 256 bytes. 

• LRH:VLmustnotbe 15 

• BTH:QP must be 1 

• BTH:OpCode must be Send only UD 

• MADHeaderBaseversion must be 1 

MADHeaderMgmtClass must specify a class supported on the 
channel adapter, switch, or router. 

013-19: If a GetRespO is returned, any conditions detected that have cor- 
responding codes assigned in Table 104 MAD Common Status Field Bit 
Values on page 587 shall be reflected by corresponding settings of the bits 
in MADHeadenStatusField of the response. 
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1 



It is not specified how GMPs passing through the GSI are dispatched to 
the appropriate class agents that are supported and which are not redi- 2 
rected. 3 



4 
5 



10 
11 



If a class is supported and if redirection has been configured for that class, 
the response to a request arriving at the GSI containing MADHeader:Mg- 
mtClass of a redirected class is to reply with the redirection information for ^ 
the class as specified in 13.5.2 GSI Redirection on page 603 . 7 

8 

On CAs implementing the verbs layer specified in Chapter 11: Software g 
Transport Verbs on page 446 . GSMs may be implemented either below 
the verb layer or above the verb layer. For a GSM implemented above the 
verb layer and communicating via a source QP 1 , it is not specified how 
disambiguation between GMPs destined to GSAs on that CA and GMPs 
destined to that GSM is performed. The basis for such differentiation is ^3 
both class and context dependent as well as implementation dependent. 14 

15 

C13-48: If a CA does not support operation of a GSM via QP 1 from on 
top of its verb layer, that is, if it does not implement disambiguation of 
GMPs destined to a GSA below the verbs and GMPs destined to QP 1 as- 
sociated with a GSM implemented above the verbs, it shall not permit QP 
1 to be created above the verb layer. 

20 

See also 13.5.1.2 Processing General Services Management Packets 21 
(GMPs) on page 602 . 22 



17 
18 



23 
24 



Regardless of the implementation, the behavior of GSAs and GSMs with 
respect to the injection of messages on the wire, processing of messages 
from the wire, and responding to messages received must conform to the 
requirements of applicable sections in Chapter 16: General Services . 26 

27 

Implicitly, implementations of SAs, GSMs and GSAs must be able to send 28 
GMPs destined to QP 1 . 29 



30 
31 



13.5.3.2.2 MAD validation at the SA and GSAs 

Packets arriving at an SA or GSA via QP 1 have already been validated 
as properly formed MADs. 

33 

Packets arriving at a SA or GSA via any QP other than QP1 have been 34 
redirected. Such packets have not been validated as properly formed 35 
MADs. 35 

37 
38 
39 

• The data payload length must be 256 bytes 40 

• LRH:VL must not be 0 

42 
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O13-20: Agents processing GMPs that have been redirected shall first val- 
idate the GMPs as follows: 



C1 3-49: All packets arriving for processing at an SA or a GSA shall be fur- 
ther validated as follows: 



The Destination QP and the Source QP of the request are 
swapped and become the Source QP and the Destination QP re- 
spectively in the response packet 



InfiniBand™ Architecture Release 1.0 Management Model October 24, 2000 

Volume 1 - General Specifications FINAL 

• The BTH:OpCode must be Send only UD 1 

• MADHeaderBaseVersion must be 1 2 

3 

MADHeaderMgmtClass must specify a class supported on the 
channel adapter, switch, or router. ^ 

5 

6 
7 

MADHeaderMethod must specify a method valid for the sped- 8 
fied class 9 

• MADHeaderAttributelD must specify an attribute supported by 10 
the class specified in MADHeader:MgmtClass. 11 

C13-50: GMPs failing one or more validity checks shall be discarded un- ''^ 
less the method is a Get() or a Set(). For Get() and Set() methods, the re- 13 
turn of a GetResp() when validation has failed is optional. 14 



15 
16 
17 



013-21: If a GetResp() is returned, any conditions detected that have cor- 
responding codes assigned in Table 104 MAD Common Status Field Bit 
Values on page 587 shall be reflected by corresponding settings of the bits 
in the status field of the response. ^ ^ 

19 

GSAs are not required to check the validity of the attribute content. 20 

21 

Additional class specific checking requirements may be specified. Such 22 
requirements, if any, are defined in the class specific sections of Chapter 23 
1 5: Subnet Administration on page 670 and Chapter 1 6: General Services 
on page 717 . 24 

25 

13.5-4 Response Generation 26 

Some methods require that the recipient return a response to the sender. 

This requires that the recipient be able to build a properly formed mes- 28 

sage which is consistent with the address and access rules associated 29 

with the sender. 30 

31 

In general, the sender may be in the same subnet or in a different subnet. ^2 
Correspondingly, a request packet may or may not include a GRH. 

3 3 

CI 3-51 : If the request packet does not contain a GRH, the response 
packet shall not contain a GRH and the response packet is constructed as 35 
follows: 36 

37 

The SLID and DLID of the request packet are swapped and be- 3g 
come the DLID and SLID respectively in the response packet. 



39 
40 
41 
42 
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• The SL specified in the request packet is used as the SL in the re- 1 
sponse. 2 

• Other fields required in headers in the response packet are cop- 3 
ied without change from the corresponding fields in the request 4 
packet. 5 

• A GRH is not inserted in the response packet. 6 

C13-52: If the original request packet contained a GRH, then the re- ^ 
sponse packet must also contain a GRH. In this case the response packet 8 
is constructed as follows: 9 

10 

• The SGID and DGID in the GRH of the incoming packet are ^ ^ 
swapped and become the DGID and SGID respectively in the 

GRH of the response packet. ^ ^ 

• FlowLabel and TrafficClass are copied without change from the 
GRH in the request packet to the GRH in the response packet. 

1 5 

• HopLimit in the GRH of the response packet is set to OxFF. ^ g 

• The SLID and DLID of the request packet are swapped and be- 17 
come the DLID and SLID respectively in the response. 

• The Destination QP and the Source QP of the request are 1 9 
swapped and become the Source QP and the Destination QP re- 20 
spectively in the response packet 21 

• The SL specified in the request packet is used as the SL in the re- 22 
sponse. 23 

Other fields required in headers in the response packet are cop- 24 

ied without change from the corresponding fields in the request 25 

packet. 26 

• The GRH as formed above must be inserted in the response 27 
packet. 28 

Note that GMP requests and responses, on the GSI or redirected, always 29 
use the well-known Q_Key (0x8001_0000) in the DETH. 30 

31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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Chapter 14: Subnet Management i 

2 

3 
4 
5 
6 

Each subnet has at least one subnet manager (SM). Each SM resides on 7 
a port of an CA, router, or switch and can be implemented either in hard- g 
ware or software. When there are multiple SMs on a subnet, one SM will 
be the master SM. The remaining SMs must be standby SMs. There is 
only one SM per port, 

11 

The master SM is a key element in initializing and configuring an IB 12 
subnet. The master SM is elected as part of the initialization process for 13 
the subnet and is responsible for: ^4 

15 
16 
17 
18 

Establishing possible paths among the endnodes 

• Sweeping the subnet, discovering topology changes and managing 20 
changes as nodes are added and deleted. 21 

The communication between the master SM and the SMAs, and among 22 
the SMs, is performed with subnet management packets (SMPs). SMPs 23 
provide a fundamental mechanism for subnet management. 24 



Discovering the physical topology of the subnet 

Assigning Local Identifiers (LIDs) to the endnodes, switches, and 
routers 



25 
26 



There are two types of SMPs: LID routed and directed route. LID routed 
SMPs are forwarded through the subnet (by the switches) based on the 
LID of the destination. Directed route SMPs are forwarded based on a 27 
vector of port numbers that define a path through the subnet. Directed 28 
route SMPs are used to implement several management functions, in par- 29 
ticular, before the LIDs are assigned to the nodes. SMPs are specified in 30 
section 14.2 Subnet Management Class on page 611 . 

32 

Every switch, CA, and router has a subnet management agent (SMA), 
managed by the master SM. SMA are specified in section 14.3 Subnet 
Management Agent on page 651 . 34 

35 

The details of operation for both master and standby SMs are described 36 
in section 14.4 Subnet Manager on page 655 37 

38 
39 
40 
41 
42 
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14.2 Subnet Management Class 



This section defines the Subnet Management class of MADs. These 
MADs are also referred to as Subnet Management Packets, or SMPs. The 
purpose of this class is to provide subnet configuration, monitoring and 
query of nodes within a subnet. SMPs are exchanged between a SM and 
SMAs on the subnet as described in Table 110 SM MAD Sources and 
Destinations on page 606 . 

There are two management classes dedicated to subnet management. 

C14-1: The subnet management classes shall be identified by the MAD- 
HeadenMgmtClass value of 0x01 for the LID Routed class and 0x81 for 
the Directed Route class as listed in Table 1 02 Management Class Values 
on page 576 . 

This section will describe class-specific methods, attributes, standard 
header fields and protocols for the Subnet Management Classes. 



14.2-1 Datagram Formats and Use 



C14-2: The datagrams in this class shall conform to the MAD format and 
usage rules as specified in section 1 3.4.2 Management Datagram Format 
on page 574 . 



14.2.1.1 SMP Data Format - LID Routed 



LID Routed SMPs are routed through the subnet using the normal switch 
forwarding tables set up during subnet initialization. 

014-3: A LID routed SMP shall have a format shown in Figure 147 on 
page 611 and Table 111 on page 612 . 

Figure 147 SMP Fornfiat (LID Routed) 
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Figure 147 SMP Format (LID Routed) 
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Table 111 SMP Fields (LID Routed) 



Object 


Length 


Description 


Common MAD 
Header 


24 bytes 


Common MAD as described in 13.4.2 Manaaement Dataaram Format on 
oaae 574. 


M_Key 


8 bytes 


A 64 bit key, which is employed for SM authentication. Usage is defined in 
section 14.2.4 Manaaement Kev on oaae 623. 


Reserved3 


32 bytes 


For aligning the SMP data field with the directed route SMP data field. Set 
to all zeroes. 


SMP Data 


64 bytes 


64 byte field of SMP data used to contain the method's attribute. 


Reserved4 


128 bytes 


Reserved. Shall be set to 0. 



14.2.1.2 SMP Data Format - Directed Route 



Directed route SMPs are routed through the subnet from SMA to SMA 
using a store-and-forward technique between neighboring nodes. They 
are therefore not dependent on routing table entries. Directed route SMPs 
are primarily used for discovering the physical connectivity of a subnet be- 
fore it has been initialized. 
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C14-4: A Directed Routed SMP shall have a format shown in Figure 147 
SMP Format (LID Routed) on oaae 611 and Table 112 SMP Fields (Di- 
rected Route) on page 613 . 



Figure 148 SMP Format (Directed Route) 
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Table 112 SMP Fields (Directed Route) 



Object 


Length 


Description 


Common MAD 
Headerl 


4 bytes 


Bvtes 0-3 of the common MAD as described in 13.4.2 Manaaement Data- 
aram Format on oaae 574. 


D 


1 bit 


Normally part of the class specific status field, this Direction bit is used by 
directed routing to determine direction of packet. 
If 0, the direction Is outbound, from SM to endnode. 
If 1 , the direction is inbound, from endnode to SM. 


Status 


15 bits 


Code indicating status of method, as defined in 13.4.7 Status Field on oaae 
587. There are no SMP status bits (bits 14-8 must be zero). 


Hop Pointer 


1 byte 


Hop Pointer is used to indicate the current byte of the Initial/Return Path 
field. 
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Table 112 SMP Fields (Directed Route) 



Hon Oniint 


1 byte 


Hod Count i<5 ij<;pri to mntain thp numhpr nf Vritid hvtp** in thp Inlti^^l/Rpturn 
Path. It indicates how many direct route 'hops' to take. 


Common MAD 
Header2 


16 bytes 


Bvtes 8-23 of common MAD as described in 13.4.2 Manaoement Data- 
qram Format on oaoe 574. 


M_Key 


8 bytes 


A 64-bit key, which is employed for SM authentication. Usage is defined in 
section 14.2.4 Manaaement Kev on oaae 623. 


DrSLID 


2 bytes 


Directed route source LID. Used in directed routing. 


DrDLlD 


2 bytes 


Directed route destination LID. Used in directed routing. 


Reserved2 


28 bytes 


For the purpose of aligning the Data field on a 64 byte boundary. Set to all 
all zeroes. 


Data 


64 bytes 


64-byte field of SMP data used to contain the method's attribute. 


Initial Path 


64 bytes 


64-byte field containing the initial directed path. Each byte in this field rep- 
resents a port. 


Return Path 


64 bytes 


64-byte field containing the returning directed path. Each byte in this field 
represents a port. 



14.2.2 SMPs AND Directed Route Algorithm 



Directed route SMPs provide a mechanism for forwarding management 
pacl<ets throughout a configured, unconfigured, or partially configured 
subnet. This mechanism can be used to discover nodes in the subnet, 
perform diagnostics or verify link connectivity, bypassing the normal 
switch LID forwarding mechanism. 

There are two components that support this mechanism in subnets: the 
Permissive destination address and directed routing. The Permissive des- 
tination address is defined in 4.1 Terminoloov And Concepts on oaoe 109 . 
When a node, including switches, receives a packet with this address it 
forwards it to its Subnet Management Interface. Directed routing permits 
the definition of an explicit route, based on intervening switch port num- 
bers, that a packet is to transverse throughout the subnet. 

Directed routing, in general, progresses much more slowly than normal 
switching. This is because each switch along the route has to perform 
some processing on every directed routed packet. Moreover, the IBA per- 
mits nodes to reserve a minimal amount of buffering for processing of 
SMPs. As a result, SMPs may be discarded in the subnet if the injection 
rate exceeds the buffering and processing capacity of the subnet and end- 
nodes. Therefore, it is recommended that directed routing only be used 
where necessary. 

The directed routing algorithm provides a method to use normal LID 
routing on either side of the directed route. No support is provided for 
more than one directed route. It is not possible to specify two or more di- 
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rected routes with intervening LID routes. This is illustrated in Figure 149 1 

Complete route usino directed routine on cage 615 . The complete route 2 

between two nodes is made up of three parts, each of them potentially 3 

empty: 4 

5 

From the source node to the source switch. This part uses LID rout- 
ing, the source node and the source switch are identified by their ^ 
LIDs. There may be other switches between them but this portion of 7 
the subnet has already been configured to allow LID routing. 8 

• From the source switch to the destination switch. This part uses di- 9 

rect routing. The route is specified by stating the port number a pack- 10 

et must use to leave a switch. This portion of the subnet need not n 

have been configured to allow LID routing. -I2 

From the destination switch to the destination node. This part uses 13 
LID routing, the destination node and the destination switch are iden- 14 
tified by their LIDs. There may be other switches between them but ^ 5 
this portion of the subnet has already been configured to allow LID 
routing. 

17 
18 

Port# pprt# 19 

20 
21 
22 
23 
24 
25 
26 
27 

SOURCE Figure 149 Complete route using DESTINATION 

directed routing 29 

30 
31 
32 

Since each part may be empty, there are eight combinations, although 33 
only four are really useful: 34 

35 
36 
37 
38 
39 
40 
41 
42 
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• All three parts are empty. This Is used to loopback to oneself before a 1 
LID has been assigned. See Figure 1 50 Loopback usino directed 2 
routing on page 616 . 3 

4 

5 
6 

Port # I 7 

8 
9 
10 
11 
12 
13 
14 
15 

Both LID routed parts are empty. This is a pure directed route used in 16 
a portion of a subnet not configured for LID routing. See Figure 151 17 
Pure directed route on page 616 . 18 

19 
20 

Port# Port# 21 

22 
23 

Port#^ >ort# __^:r7-^.^. 24 

25 
26 
27 
28 
29 
30 

One of the LID routed part is empty. This is used when a portion of ^1 
the subnet, either at the source or at the destination, has been config- 32 
ured for LID routing. See Figure 152 Directed route with LID routing 33 
part at the source on page 617 and Figure 153 Directed route with 34 

35 
36 
37 
38 
39 
40 
41 
42 




SOURCE DESTINATION 
Figure 151 Pure directed route 



InfiniBand^"" Trade Association 



Page 616 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1 .0 
Volume 1 - General Specifications 



Subnet Management 



October 24, 2000 
FINAL 



LID routing part at the destination on page 617 . 



Port# 



Port# 




DESTINATION 



SOURCE 



Figure 152 Directed route with 
LID routing part at the source 



Port# 



Port# 




SOURCE 



(Node^ 

Figure 153 Directed route with V y 

LID routing part at the destination DESTINATION 



No part is empty. This is the general case as illustrated in Figure 149 
Complete route using directed routing on page 615 when portions of 
the subnet have been initialized both at the source and destination 
but not in between. 

The following section describes how a directed route packet is initialized, 
how a return packet is initialized, and the algorithm used along the route 
by switches to forward the packets. Note that in the switched portions of 
a route, the nodes are not directly involved - the packet is switched along 
the path just as any other packet is. 
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14.2.2.1 Outgoing Directed Route SMP Initialization 1 

CI 4-5: Only a SM shall originate a directed route SMP. 2 

3 

In the following sections, the source node where the SM originator resides 4 

is called the requestor node, and the destination node is called the re- 5 

spender node, even when describing the return process. When directed g 
route is used, it refers to the directed route part only, not the complete 
route. 

8 

CI 4-6: The fields of the directed route SMP shall be initialized as follows: 9 

10 

3) Mgmt Class shall be set to the directed route Subnet Management 11 
class as specified in Table 102 Management Class Values on page ^ 2 

SS- 13 

4) Method shall be set to SubnGetQ or SubnSetQ as specified in Table -14 
103 Common Management Methods on page 578 . 

5) D bit shall be set to 0. 16 

6) Hop Pointer shall be set to 0. 

7) Hop Count shall be set to the number of hops, i.e. inter-switch links, 
along the directed route part. Valid values are from 0 to 63. In Figure 
149 Complete route using directed routing on page 615 . this number 20 
would be 3. In Figure 150 Loopback using directed routing on page 21 
616 . this number would be 0. 22 

8) If the directed route part starts from the requestor node, i.e. there is 23 
no LID routed part at the source as illustrated in Figure 150 Loopback 24 
using directed routing on page 616 . Figure 151 Pure directed route 25 
on page 616 or Figure 153 Directed route with LID routing part at the 26 
destination on page 617 . then the DrSLID shall be set to the Per- 27 
missive LID. If the directed route does nof start from the requestor 
node, then DrSLID shall be set to the LID of the requestor node, 

which must have been assigned. 29 

9) If the directed route ends at the responder node, i.e. there is no LID 
routed part at the destination as illustrated in Figure 150 Loopback 
using directed routing on page 616 . Figure 151 Pure directed route ^2 
on page 616 or Figure 152 Directed route with LID routing part at the 33 
source on page 617 . then the DrDLID shall be set to the Permissive 34 
LID. If the directed route does not end at the responder node, then 35 
the DrDLID shall be set to the LID of the responder node, which must 35 
have been assigned. 2^ 

10) Initial Path shall be set to an array of Hop Count port numbers corre- 33 

spending to the ports at the starting end of hops, specifically, the port 39 

from which the SMP will start travelling on the inter-node link along 

the directed route. The array shall be laid out in the initial Path field 

^ 41 

42 
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31 
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3) All other fields shall be set the same way they are for a LID routed 
SMR 



C14-9: The SMI shall handle outgoing directed route SMPs (D bit is 0) as 
defined by the following steps: 



16 
17 
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so that the byte at offset 0 in that field is reserved and the following 1 
bytes are filled with the port numbers in order. 2 

11 ) Return Path shall be set to an array of Hop Count zeroes. The array 3 
shall be laid out in the Return Path field so that the byte at offset 0 in 4 
that field is reserved and the following bytes are filled with zeroes. 5 

12) All other fields shall be set the same way they are for a LID routed 6 
SMP 7 

C14-7: The data packet headers for the unreliable datagram encapsu- ^ 

lating the directed route SMP shall be initialized as follows: 9 

10 

1 ) If the directed route part starts from the requestor node, the SLID 1 1 
shall be set to the Permissive LID. If the directed route does not start ^ 2 
from the requester node, the SLID shall be set to the LID of the re- ^ ^ 
questor node, which must have been assigned. 

2) DLID shall be set to the Permissive LID if the directed route part ^ 5 
starts from the requestor node. If not, it shall be set to the LID of the 
source switch in the directed route part. That LID must have been as 
signed and routing must have been initialized between that switch 
and the requestor node. 

19 
20 
21 

The SM will then hand the packet to the SMI. If the DLID is the Permissive 22 
LID, the SMI processes the packet as described in section 14.2.2.2 Out- 
qoing Directed Route SMP handling bv SMI on page 619 . If the DLID is 
not the Permissive LID, the SMI will output the packet as it does any LID 
routed packet. 25 

26 

14.2.2.2 Outgoing Directed Route SMP handling by SMI 27 

C14-8: Any SMP arriving at the SMI with a MADHeaderMgmtClass set to 28 
0x81 (Directed Route class) shall be processed by the SMI. 29 

30 
31 
32 

1 ) If HopCount is non zero and Hop Pointer is less than Hop Count (in 33 
the range between 0 to Hop Count -1 ): 34 

The SMI shall alter the contents of the directed route SMP, in order, as 
follows: 36 

a) If Hop Pointer is more than 0, if this node is not a switch, the SMI 
shall discard the SMP, otherwise the entry indexed by Hop Point- 38 
er in the Return Path array of port numbers shall be set to the port 39 
number where the SMP was received. 40 

b) The Hop Pointer shall be incremented by 1. 41 

42 
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The SMI shall alter the data packet headers for the unreliable data- 
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c) All other fields shall remain unchanged. 1 

The data packet headers for the unreliable datagram encapsulating ^ 
the directed route SMP shall be altered as follows: 3 

4 
5 

e) The DLID shall be set to the Permissive LID. g 

The SMI shall output the packet on the port whose number is in the 7 
entry indexed by Hop Pointer in the Initial Path. If that port number is q 
invalid, the SMI shall discard the SMP. g 

2) If Hop Pointer is equal to Hop Count, the SMI is the last one on the di- 1 0 
rected route part. The SMI shall alter the contents of the directed n 
route SMP as follows: ^ 2 

a) The entry indexed by Hop Pointer in the Return Path array of port 1 3 
numbers shall be set to the port number where the SMP was re- 14 
ceived if Hop Pointer is non-zero. ^ 5 

b) The Hop Pointer shall be incremented by 1. 16 

c) All other fields shall remain unchanged. 

18 
19 
20 

d) If DrDLID is the Permissive LID, the SLID shall be set to the Per- 
missive LID. If DrDLID is nof the Permissive LID, SLID shall be 
set to the LID of this node. (That LID must have been assigned 
and routing must have been initialized between this node and the ^3 
responder node.) 24 

e) DLID shall be set to the DrDLID. 

26 

If the DLID is the Permissive LID, this node is the responder node and 27 
the SMI shall hand the packet to the SMA or SM, which may check that 
Hop Pointer is equal to HopCount+1. If the DLID is nof the Permissive 
LID, the SMI will output the packet as it does any LID routed packet if 29 
this node is a switch (if this node is not a switch, the SMP shall be si- 30 
lently discarded). 31 

3) If Hop Pointer is equal to Hop Count+1 , this node is the responder 32 
node and the SMI shall hand the packet to the SMA or SM, which 33 
may check that Hop Pointer is equal to Hop Count+1 . 34 

4) If Hop Pointer is greater than Hop Count+1 (Hop Count+2 to 255), 35 
the SMI shall silently discarded the SMP 36 

07 

The handling of returning directed route SMPs (D bit is 1) is described in 
section 14.2.2.3 Returning Directed Route SMP Initialization on page 621 . 38 

39 
40 
41 
42 
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14.2.2.3 Returning Directed Route SMP Initialization 1 

The SMA or SM receiving a directed route SMP processes it (with regard 2 
to handling of the method and attribute) as it does a LID routed SMP. The 3 
receiving SMA or SM may determine that it should send a response, 4 

5 

C14-10: The fields of the directed route response SMP shall be initialized g 
as follows: 

7 

5) Method shall be set to SubnGetResp() as specified in Table 103 ^ 
Common Management Methods on page 578 . 9 

6) D bit shall be set to 1. 

11 

7) Mgmt Class, Hop Pointer, Hop Count, DrSLID, DrDLID, Initial Path 
and Return Path shall be copied as is from the request SMP. 

8) All other fields shall be set the way they are set for a LID routed SMP. 

014-11 : The data packet headers for the unreliable datagram encapsu- 1 5 
lating the directed route response SMP shall also be initialized as follows: 16 



14.2.2.4 Returning Directed Route SMP handling by SMI 



17 
18 



1 ) If the SLID in the LRH for the unreliable datagram encapsulating the 
directed route request SMP was the Permissive LID which indicates 
that the returning directed route starts from the responder node (this 
node), then the SLID shall be set to the Permissive LID. If the SLID of 20 
the directed route request SMP was not the Permissive LID, then the 21 
SLID shall be set to the LID of the responder node (this node). That 22 
LID must have been assigned and routing must have been initialized 23 
between the responder node (this node) and the node whose LID is 24 
the SLID in the directed request SMP. 

25 

2) The DLID shall be set to the SLID of the directed route request SMP. 26 

3) All other fields shall be set the way they are set for a LID routed SMP. 27 

The SMA or SM will then hand the packet to the SMI. If the DLID is not the 

Permissive LID, the SMI will output the packet as it does any LID routed 29 

packet. If the DLID is the Permissive LID, the SMI processes the packet 30 

as described in the following section. 31 

32 
33 

014-12: Any SMP arriving at the SMI with a MADHeaderMgmtClass set 34 
to 0x81 (Directed Route class) shall be processed by the SMI. 25 

014-13: The SMI shall handle returning directed route SMPs (D bit is 1) 
as defined by the following steps: 

38 
39 
40 
41 
42 
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1 ) If HopCount is non-zero and Hop Pointer is more than 1 (2 to Hop 1 
Count+1): 2 

The SMI shall alter the contents of the directed route SMP, in order, as 3 
follows: 4 

f) Hop Pointer shall be decremented by 1 . ^ 

6 

g) All other fields shall remain unchanged. ^ 

The SMI shall alter the data packet headers for the unreliable data- q 
gram encapsulating the directed route SMP as follows: g 

h) SLID shall be set to the Permissive LID. io 

i) DLID shall be set to the Permissive LID. 11 

12 

The SMI shall output the packet on the port whose number is in the 
entry indexed by Hop Pointer in the Return Path. If that port number is ^ ^ 
invalid, the SMI shall discard the SMP 14 

15 

2) If Hop Pointer is equal to 1 , the SMI is the last one on the directed 
route part. The SMI shall alter the fields of the directed route SMP, in ^ ^ 
order, as follows: 17 

1 8 

j) Hop Pointer shall be decremented by 1. 

19 

k) All other fields shall remain unchanged. 20 

The SMI shall alter data packet headers for the unreliable datagram 21 
encapsulating the directed route SMP, in order, as follows: 22 

I) SLID shall be set to the Permissive LID if DrSLID is the Permis- 23 

sive LID. If not, it shall be set to the LID of this node.That LID 24 

must have been assigned and routing must have been initialized 25 

between this node and the requestor node. 25 

m) DLID shall be set to the DrSLID. 27 

If the DLID is the Permissive LID, then this node is the responder node 28 
and the SMI must hand the packet to the SM which may check that 29 
Hop Pointer is equal to 0. 30 

If the DLID is not the Permissive LID and if this node is a switch, the 31 

SMI will output the packet as it does any LID routed packet. If the DLID 32 

is not the Permissive LID and this node is not a switch, then the SMP 33 

shall be silently dropped. 34 

3) If Hop Pointer is equal to 0, this node is the requestor node and the 35 
SMI must hand the packet to the SM, which may check that Hop 36 
Pointer is equal to 0. 37 

4) If Hop Pointer is in the range (HopCount+2) to 255, then the SMI 38 
shall silently discard the SMP. 39 

40 
41 
42 
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14.2.3 Methods 



The handling of outgoing directed route SMPs (D bit is 0) is described in 
section 14.2.2.2 Outaoino Directed Route SMP handlino by SMI on pace 
619. 



The Subnet Management class uses a subset of the common methods 
described in section 13.4.5 Management Class Methods on page 577 . 



Table 113 Subnet Management Methods 



Method Type 


Value 


Description 


SubnGetO 


0x01 


Request a get (read) of an attribute. 


SubnSetO 


0x02 


Request a set (write) of an attribute. 


SubnGetRespO 


0x81 


Response from a get or set request. 


SubnTrapO 


0x05 


Notify an event occurred. 


SubnTrapRepressO 


0x07 


Cease sending repeated Trap. 



14.2.4 Management Key 



Table 110 SM MAD Sources and Destinations on oaae 606 indicates 
which methods are applied to SMPs that originate at a SM, SMPs that 
originate at a SMA, and SMPs that may be destined to SMAs or to SMs. 

C14-14: Subnet Management entities, the SMA and SM, shall support the 
methods listed in Table 113 Subnet Management Methods on page 623. 



SMPs are used to initialize and configure CAs, switches and routers, and 
are therefore considered privileged operations. As a result, there is a 
mechanism provided to authorize subnet management operations based 
on: 

• a Key stored in the MADHeader:M_Key of the LID routed and Direct- 
ed route subnet management class datagram as shown in Figure 147 
SMP Format (LID Routed) on page 611 and Figure 148 SMP Format 
(Directed Route) on page 613 . respectively. 

• a Key kept locally on each port in the Portlnfo:M_Key component of 
the Portlnfo attribute that is described in Table 127 Portlnfo on page 
634 . 

Authentication is performed by the management entity at the destination 
port and is achieved by comparing the key contained in the SMP with the 
key residing at the destination port. This key is known as the Management 
Key (M_Key). 
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C14-1 5: A M_Key contained in the MADHeader:M_Key of the SMP shall 1 
not be checked at the receiving port with the Portlnfo:M_Key set to zero. 2 
As a result, no authentication is performed. 3 



If the Portlnfo:M_Key is nonzero, authentication at the receiving port and 
access to the port attributes is determined by the contents of the Port- 
lnfo:M_KeyProtectBits as described in section 14.2.4.1 Levels of Protec- 
tion on page 624 . Finally, M_Keys can be lost, so Key recovery is provided 
by the Portlnfo:M_KeyLeasePeriod components and is described in sec- 
tion 14.2.4.2 Lease Period on page 624 . 



14.2.4.1 Levels of Protection 



C14-16: If the Portlnfo:M_Key is non-zero, the management entity re- 
siding at the port shall perform authentication determined by the contents 
of the Portlnfo:M_KeyProtectBits and the behaviors described in Table 
114 Protection Levels on page 624 . 

Table 114 Protection Levels 



Portlnfo:M_KeyProtectBits 


Description 


0 


SubnGetC) shall succeed for any key in the MADHeader:M_Key 
and SubnGetResp(Portlnfo) shall return the contents of the Port- 
lnfo:M_Key component. 

SubnSetC) shall fall if MADHeader:M_Key does not match the 
Portlnfo:M_Key component in the port. 


1 


SubnGetC) shall succeed for any key in the MADHeader:M_Key 
and SubnGetResp(Portlnfo) shall return the contents of the Port- 
Info'MJKey component set to zero If MADHeader:M_Key does 
not match the Portlnfo:M_Key component in the port. 

SubnSetC) shall fail if MADHeader:M_Key does not match the 
Portlnfo:M_Key component in the port. 


2 


SubnGetC) and SubnSetC) shall fail if MADHeader:M_Key does 
not match the Portlnfo:M_Key component in the port. 


3 


SubnGetC) anct SubnSetC) shall fail \f MADHeader:M_Key 6oes 
not match the Portlnfo:M_Key component in the port. 



14.2.4.2 Lease Period 



A Lease Period is specified by setting the contents of tfie Port- 
lnfo:M_KeyLeasePeriod component. It is intended to allow an M_Key to 
'expire' if the master SM inadvertently goes away without sharing the 
M_Key with backup SMs and there is no other out-of-band recovery 
mechanism available. 

C14-17: The lease period timer shall start counting down toward zero on 
a port when a SMP is received for which the M_Key check was performed 
according to Table 114 and failed. If the lease timer countdown is already 
undenA/ay, it shall not be interrupted by the arrival of that SMP. 
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C14-18: The Portlnfo:M_Key Violations component shall be incremented 1 

on a port when a SMP is received for which the M_Key check was per- 2 

formed according to Table 112 and failed. The incrementing shall stop 3 

when the component reaches all Is. ^ 

Furthermore, if the port is capable of sending traps, a M_Key violation trap 

described in Table 117 Traps on oaoe 628 may be sent to the master SM ^ 

indicating that the lease timer has started counting down. 7 

8 

C14-19: The lease period counter shall cease counting down and shall be g 

reset to the value contained in Portlnfo:M_KeyLeasePeriod component 

on a port when any SMP is received with MADHeader:M_Key that 

matches the Portlnfo:M Key. 

- 12 

In response to that trap, the master SM may refresh the Lease Period. If 13 

the master SM that originally set the M_Key has gone away, the Lease 14 



Period may expire. 15 

16 

C14-20: The Portlnfo:M_KeyProtectBits shall be set to zero when the 
lease period counter transitions from non-zero to zero. 

18 

When the lease period expires, clearing the M_Key Protection bits will al- 
lowing any SM to read (and then set) the M_Key. 20 

21 

C14-21 : When the Portlnfo:M_KeyLeasePeriocl is set to zero, the lease 22 
period shall never expire. 23 

24 

Whether there is an out-of-band mechanism to reset data protected with 
a lease period of zero is outside the scope of the specification. 

26 

14.2.4.3 Notes on Expected Usage 27 

• The SM is responsible for keeping track of the M_Keys for the 28 
nodes that it is managing, to make sure that it uses the correct 29 
key for each node. 30 

If standby SMs exist in the subnet for redundancy, then the 31 
M_Keys may be shared so that failover to another SM can be ac- 32 
commodated easily. 33 

An SM may have exclusive access to a node (or set of nodes), by 34 
using an M_Key which is only known by that SM and the particu- 35 
lar node(s). 3g 

• SubnSetQ is always protected by this mechanism as it can affect 37 
the state of the node. SubnGetQ is protected only if Portin- 33 
fo:M_KeyProtectBits is appropriately set. 3g 

14.2.4.4 Update Procedure 40 

Node protection/ownership is assigned in one "atomic" operation. 41 

42 
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14.2.4.5 Initialization 



14.2.4.6 SMI 



14.2.5 Attributes 



CI 4-22: The Portlnfo:M_Key, the Portlnfo:M_KeyProtectBits, the Port- 
lnfo:M_KeyLeasePeriod components in the Portlnfo Attribute shall be set 
in one SubnSet(Portlnfo) method. 

A returned SubnGetResp(Portlnfo) with a status of zero indicates to the 
SM that it has taken ownership of the node. 



CI 4-23: When initially powered-up or reset, the Portlnfo: M_Key, the Port- 
lnfo:M_KeyProtectBits, the Portlnfo:M_KeyLeasePeriocl oomponenls of a 
OA, management port of a switch, or router shall be set to zero if NVRAM 
is not used or to a value stored in NVRAM. 

If not stored in NVRAM, the Portlnfo:M_Key the Port- 
lnfo:M_KeyProtectBits, the Portlnfo:M_KeyLeasePeriod components 
may be set by the master SM during subnet initialization. 



The SMI will not check the M_Key in the header of a SMP since that is the 
responsibility of the management entities that reside behind the SMI. 



In the SMP, attributes can be up to 64 bytes long. The Table 115 Subnet 
Manaoement Attributes (Summary) on pace 626 summarizes the subnet 
management attributes and Table 116 Subnet Management Attribute / 
Method Map on page 627 indicates which methods apply to each at- 
tribute. 

C14-24: Subnet management entities shall support the attributes and 
methods as listed in Table 113: Subnet Management Attributes (Sum- 
mary) and Table 1 14: Subnet Management Attribute / Method Map . I 



Table 115 Subnet Management Attributes (Summary) 



Attribute Name 


Attribute 
ID 


Attribute Modifier 


Description 


Required 
For 


Notice 


0x0002 


0x0000_0000 


Information regarding the associ- 
ated Notice or Trap 


Optional 


NodeDescription 


0x0010 


OxOOOO_0000 


Node Description String 


All Nodes 


Node Info 


0x0011 


OxOOOO_0000 


Generic Node Data 


All Nodes 


Switch Info 


0x0012 


OxOOOO_0000 


Switch Information 


Switches 


GUIDInfo 


0x0014 


GUID Block 


Assigned GUIDs 


All CAs. 
Routers, 
and switch 
mgmt ports 
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Table 115 Subnet Management Attributes (Summary) 



Attribute Name 


Attribute 
ID 


Attribute Modifier 


Description 


Required 
For 


Portlnfo 


0x0015 


Port Number 


Port Information 


All Ports on 
All Nodes 


PartitionTable 


0x0016 


PortNunnber/P_Key 
block 


Partition Table 


All Ports on 
All Nodes 


SLtoVLMappingTable 


0x0017 


Input/Output Port 
Number 


Service Level to Virtual Lane 
mapping Information 


All Ports on 
All Nodes 
(optional^) 


VLArbitration 


0x0018 


Output Port/Com- 
ponent 


List of Weights 


All Ports on 
All Nodes 
(optional^) 


LinearForwardingTable 


0x0019 


LID Block 


Linear Forwarding Table Informa- 
tion 


Switches 
(optional^) 


RandomFonvardlngTable 


OxOOIA 


LID Block 


Random Forwarding Database 
Information 


Switches 
(optional^) 


MulticastForwardingTable 


0x0016 


LID Block 


Multicast Forwarding Database 
iiiiurrTiaiiur) 


Switches 
\upu\jr\ai) 


SMInfo 


0x0020 


OxOOOO_0000 - 
0x0000 0005 


Subnet Management Information 


All nodes 

hnctinn an 
1 luoiii 1^ at I 

SM 


VendorDiag 


0x0030 


OxOOOO_0000 - 
OxOOOO_FFFF 


Vendor Specific Diagnostic 


All Ports on 
All Nodes 


Led Info 


0x0031 


OxOOOO_0000 


Turn on/off LED 


All nodes 




OxFFOO- 
OxFFFF 


0x0000 0000- 
OxOOOO_FFFF 


Range reserved for Vendor Spe- 
cific attributes. 
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a. Optional on ports that support only one data VL. 

b. Prohibited on ports that support only one data VL. 

c. LinearForwardingTable and RandomFonwardingTable are mutually exclusive, but one is required. 

d. LinearForwardingTable and RandomForwardingTable are mutually exclusive, but one is required. 

Table 116 Subnet Management Attribute / Method Map 



Attribute Name 


Get 


Set 


Trap 


Notice 


X 


X 


X 


NodeDescription 


X 






Nodelnfo 


X 






Switchlnfo 


X 


X 




GUIDInfo 


X 


X 
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Table 116 Subnet Management Attribute / Method Map 



Attribute Name 


Get 


Set 


Trap 


Portlnfo 


X 


X 




PartltionTable 


X 


X 




SLtoVLMappingTable 


X 


X 




VLArbltration 


X 


X 




LinearForwardingTable 


X 


X 




RandomForwardingTable 


X 


X 




MultlcastForwardingTable 


X 


X 




SMInfo 


X 


X 




VendorDlag 


X 






Led Info 


X 


X 





14.2-5.1 Notice 



This attribute is a common attribute described in section 13.4.8.2 Notice 
on page 591 . The following traps are defined for the Subnet Management 
class. 

Table 117 Traps 



Number 


Sending 
Node Type 


DataDetails 


128 


switch 


Link state of at least one port of switch at <LIDADDR> has changed. 


129 


any 


Local Link Integrity threshold reached at <LIDADDR><PORTNO> 


130 


any 


Exccessive Buffer Overrun threshold reached at 
<LIDADDR><PORTNO> 


131 


switch 


Flow Control Update watchdog timer expired at 
<LIDADDR><PORTNO> 


256 


any 


Bad M Key, <MKEY> from <LIDADDR> attempted <METHOD> with 
<ATTRIBUTEID> and <ATTRIBUTEMODIFIER>, 


257 


any 


Bad P_Key, <KEY> from <LIDADDR1> /<GIDADDR1>/<QP1> to 
<LIDADDR2>/<GIDADDR2>/<QP2> on <SL>. 


258 


any 


Bad Q_Key, <KEY> from <LIDADDR1>/<GIDADDR1>/<QP1> to 
<LIDADDR2>/<GIDADDR2>/<QP2> on <SL>. 
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Traps use the following layout for the DataDetails component of the Notice 
attribute. Fields shall be filled with the information corresponding to the 
description of a given trap. 

Table 118 Notice DataDetails For Trap 128 



Field 


Length(bits) 


Description 


LIDADDR 


16 


Local Identifier 


Padding 


416 


Shall be ignored on read. Content 
is unspecified. 


Table 119 Notice DataDetails For Traps 129, 130 and 131 


Field 


Length(bits) 


Description 


ReservedO 


16 


Shall be filled with zeroes 


LIDADDR 


16 


Local Identifier 


PORTNO 


8 


Local Identifier 


Padding 


392 


Shall be ignored on read. Content 
is unspecified. 


Table 120 Notice DataDetails For Trap 256 


Field 


Length(bits) 


Description 


ReservedO 


16 


Shall be filled with zeroes 


LIDADDR 


16 


Local Identifier 


Reserved 1 


16 


Shall be filled with zeroes 


METHOD 


8 


Method 


Reserved2 


8 


Shall be filled with zeroes 


ATTRIBUTEID 


16 


Attribute ID 


ATTRIBUTEMODIFIER 


32 


Attribute Modifier 


MKEY 


64 


M_Key 


Padding 


288 


Shall be Ignored on read. Content 
is unspecified. 


Table 121 Notice DataDetails For Traps 257 and 258 


Field 


Length(bits) 


Description 


ReservedO 


16 


Shall be filled with zeroes 


LIDADDR1 


16 


Local Identifier 
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Table 121 Notice DataDetails For Traps 257 and 258 



Field 


Length(bits) 


Description 


LIDADDR2 


16 


Local Identifier 


KEY 


32 


Q_Key or P_Key. 

If Q li^ov/ tho 1f\ mnct cinnifipant 
II 1 r^cy, Ulc lO IllUol oiyillllUaill 

bits of the field shall be set to 0 
and the 16 least significant bits of 
the field shall be set to the P_Key. 


SL 


4 


Service Level 


Reserved2 


4 


Must be filled with zeroes 


QP1 


24 


Queue Pair 


ReservedS 


8 


Must be filled with zeroes 


QP2 


24 


Queue Pair 


vjlUAUIJKI 




oioDai laeniiTier. 

If no GRH is present in the offend- 
ing packet, this field shall be filled 
with zeroes. 


GIDADDR2 


128 


Global Identifier. 

If no GRH is present in the offend- 
ing packet, this field shall be filled 
with zeroes. 


Padding 


32 


Shall be ignored on read. Content 
is unspecified. 



14.2.5.2 NodeDescription 



Table 122 NodeDescription 



Component 


Access 


Length(bits) 


Description 


NodeString 


RO 


512 


UNICODE string to describe node in text format. 



14.2.5.3 NODElNFO 



The Nodelnfo Attribute provides fundamental management information 
common to all CAs, routers, and switches. It shall be implemented by all 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfiniBand^'^ Trade Association 



Page 630 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Subnet Management 



October 24, 2000 
FINAL 



nodes.The value of some Nodelnfo components varies by port within a 
node. 

Table 123 Nodelnfo 



Component 


Access 


Length (bits) 


Offset (bits) 


DescriDtion 


Daseversion 




Q 
O 


u 


Supported MAD Base Version. Indicates that this node 
supports up to and including this version. Set to 1. 


ClassVersion^ 


RO 


8 


8 


Supported Subnet Management Class (SMP) Version. 
Indicates that this node supports up to and including this 
version. Set to 1. 


iNooe 1 ype 




Q 

o 


ID 


1! Channel Adapter 
2: Switch 
3: Router 

0 4 - Rp<iprved 


NumPorts^ 


RO 


8 


24 


Number of physical ports on this node. 


Reserved 


RO 


64 


32 


Reserved, shall be zero. 


NodeGUID® 


RO 


64 


96 


GUID of the HCA, TCA, switch, or router itself. All ports 
on the same node shall report the same Node-GUID. 
Provides a means to uniquely identify a node within a 

Qiihnpf anH H^tprmin^ f*o-lnP5itlr*n nf nnrtc 


PortGUID^ 


RO 


64 


160 


lir^ rif thic nr»rt itQ^lf On^ nort u/ifhin a nrwHo r^an rotiirn 
yj\jiu \Ji 11 iio yjsjt i iioci 1 . \Ji ic yJKJi I will III 1 d 1 luuc Udi 1 1 ciui 1 1 

the NodeGUID as its PortGUID if the port is an integral 
part of the node and is not field-replaceable. 


PartitionCap^ 


RO 


16 


224 


Number of entries in the Partition Table for CA, router, 
and the switch management port. This is at a minimum 
set to 1 for all nodes including switches. 


DevicelD^ 


RO 


16 


240 


Device ID information as assigned by device manufac- 
turer. 


Revision^ 


RO 


32 


256 


Device revision, assigned by manufacturer. 


LocalPortNum 


RO 


8 


288 


The link port number this SMP came on in. 


VendorlD^ 


RO 


24 


296 


Device vendor, per IEEE. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



a. Value shall be the same for all ports on a node. 

b. Value shall differ for each end port on a CA or router, but the same for all ports of a switch. 
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14.2.5.4 SWITCHlNFO 



The Switchlnfo Attribute provides management information specific to 
switch nodes. It shall be implemented by all switches. 



Table 124 Switchlnfo 



Component 


Access 


Length (bits) 


Offset (bits) 


Description 


LinearFDBCap 


RO 


16 


0 


Number of entries supported in the Linear Unicast For- 
warding Table (starting at LID=OxOOOO going up). Lin- 
earFDBCap = 0 indicates that there is no Linear 
Forwarding Database. 


RandomFDBCap 


RO 


16 


16 


Number of entries supported in the Random Unicast 
Forwarding Table. RandomFDBCap = 0 indicates that 
there is no Random Forwarding Database. 


MulticastFDBCap 


RO 


16 


32 


Number of entries supported in the Multicast Forward- 
ing Table (starting at LID=0xC0OO going up). 


1 inoarFr^RXnn 


RW 
r\v V 


lfi 


to 


InHipsiffAc tho tnn nf tHo lin^sir fnnA/firHinn tfthtlp PspkptQ 

received with unicast DLIDs greater than this value are 
discarded by the switch. This component applies only 
to switches that implement linear fonvarding tables and 
is ignored by switches that implement random forward- 
ing tables. 


DefaultPort 


RW 


8 


64 


Forward to this port all the unicast packets from the 
other ports whose DLID does not exist in the random 
forwardina table, see section Chapter 18:: Switches 


DefaultMulticastPri- 
maryPort 


RW 


8 


72 


Forward to this port all the multicast packets from the 
other ports whose DLID does not exist in the forwarding 
table, see section 18.2.4.3.3 Reauired Multicast Relav 
on Daae 824. 


DefaultMulticast- 
NotPrimaryPort 


RW 


8 


80 


Forward to this port all the multicast packets from the 
Default Primary port whose DLID does not exist in the 
forwardina table, see section 18.2.4.3.3 Required Multi- 
cast Relav on paae 824. 


LifeTimeValue 


RW 


5 


88 


Sets the time a packet can live in the switch, see sec- 
tion 18.2.5.4 Transmitter Queueina on paae 828. 


PortStateChange 


RW 


1 


93 


It is set to one anytime the PortState component in the 
Portlnfo of any ports transitions from Down to Initialize, 
Initialize to Down, Armed to Down, or Active to Down as 
a result of link state machine logic. Changes in Ports- 
tate resulting from SubnSet do no change this bit. This 
bit is cleared by writing one, writing zero is ignored. 


Reserved 


RO 


2 


94 


Reserved, shall be zero. 


LIDsPerPort 


RO 


16 


96 


Specifies the number of LID/LMC combinations that 
may be assigned to a given external port for switches 
that support the Random Fonwarding table. 
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Table 124 Switchlnfo 



Component 


Access 


Length (bits) 


Offset (bits) 


Description 


PartitionEnforce- 
mentCap 


RO 


16 


112 


Specifies the number of entries in the partition enforce- 
ment table per physical port. Zero indicates that parti- 
tion f^nfnrppmpnt not ^unnorfpri hv thp ^witrh 


InboundEnforce- 
mentCap 


RO 




128 


Indicates switch is capable of partition enforcement on 
received oackets 


OutboundEnforce- 
mentCap 


RO 




129 


Indicates switch is capable of partition enforcement on 
transmitted packets 


FilterRawPacketln- 
boundCap 


RO 




130 


Indicates switch is capable of raw packet enforcement 
on received packets 


FilterRawPack- 
etOutboundCap 


RO 




131 


Indicates switch is capable of raw enforcement on 
transmitted packets 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



14.2.5.5 GUIDINFO 



The GUIDInfo Attribute provides the means for setting the assigned local 
scope EUI-64 identifiers of channel adapters, routers, and switch man- 
agement ports. These local scope EUI-64 identifiers are concatenated 
with a subnet prefix to form GIDs that are described in section 4.1.1 GID 
Usage and Properties on page 110 . 

The Attribute Modifier is a pointer to a block of 8 GUIDs to which this at- 
tribute applies. Valid values are from 0 to 31 and are further limited by the 
size of the GUIDCap of the port. Any entries in the block beyond the end 
of the GUID table are ignored on write and read back as zero. The block 
element at offset zero is read-only and is a copy of the PortGUID compo- 
nent. 

The attribute selected corresponds to the port that received the SMP. 
Table 125 GUIDInfo 



Component 


Access 


Length(bits) 


Description 


GUIDBIock 


RW 


512 


List of 8 GUID Block Elements. 



Table 126 GUID Block Element 



Component 


Length(bits) 


Description 


GUID 


64 


GUID to be assigned to port. 



14.2.5.6 PORTlNFO 



The Portlnfo Attribute provides port-specific management information. It 
shall be implemented for every port on a node. Note that the value of 
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Sonne Nodelnfo connponents varies by node type and by port within a 
node. 

The Attribute Modifier selects the port that the operation specified by the 
SMP is performed. For switches, channel adapters, and routers, the range 
of values between 0 to N, where N is the number of ports and: 

For channel adapters and routers the value of zero indicates that the 
operation is performed on the port that received the SMP. OthenA^ise, 
if the value is non-zero and does not match the port number where 
the SMP is received, the Portlnfo attribute is RO and the M_Key is 
checked for both the port where the SMP was received and the port 
selected by the attribute modifier. 

For switches, a value of zero selects the management port. Other- 
wise, if the value is non-zero, a physical port is selected. 

Also, SubnSet(Portlnfo) operations with component values set to zero 
and reserved values are ignored (NOP). 

Table 127 Portlnfo 



Component 


Access 


Length(btts) 


Offset (bits) 


Description 


M_Key^ 


RW 


64 


0 


The 8-bvte manaaement kev. See section 14.2.4 
Manaaement Kev on oaae 623. 


GidPrefix^ 


RW 


64 


64 


GID prefix for this port. 


LID^ 


RW 


16 


128 


The base LID of this port. 


MasterSMLID^ 


RW 


16 


144 


The base LID of the master SM that is managing 
this port. 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
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23 
29 
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Table 127 Portlnfo 



Component 



Access 



Length(bits) 



Offset (bits) 



Description 



CapabilityMask^ 



RO 



32 



160 



Supported capabilities of this node. A bit set to 1 for 
affirmation of supported capability. 
0: Reserved, shall be zero 
1: IsSM 

2: IsNoticeSupported 

3: IsTrapSupported 

4: IsResetSupported 

5: IsAutomaticMigrationSupported 

6: IsSLMappingSupported 

7: IsMKeyNVRAM (supports M_Key in NVRAM) 

8: IsPKeyNVRAM (supports P_Key in NVRAM) 

9: IsLEDInfoSupported 

10: IsSMdisabled 

11 - 15: Reserved, shall be zero 

16: IsConnectionManagementSupported 

17: IsSNMPTunnelingSupported 

18: Reserved, shall be zero 

19: IsDeviceManagementSupported 

20: IsVendorClassSupported 

21-31: Reserved, shall be zero 



DiagCode^ 



RO 



16 



192 



Diagnostic code, as described in section 14.2.5.6.1 
on page 640 . 



M_KeyLeasePeriod^ 



RW 



16 



208 



Timer value used to indicate how long the M_Key 
Protection bits are to remain non zero after a 
SubnSet(Portlnfo) fails a M_Key check. The value 
of the timer indicates the number of seconds for the 
lease period. With a 16 bit counter, the period can 
range from one second to approximately 18 hours. 
Default value shall be 15 seconds. 0 shall mean infi- 
nite. See section 14.2.4 Management Kev on oaae 
623. 



LocalPortNum 



RO 



224 



The link port number this SMP came on in. 
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Table 127 Portlnfo 



Component 


Access 


Length(bits) 


Offset (bits) 


Description 


LinkWidthEnabled*^ 


RW 


8 


232 


Enabled link width, indicated as follows: 
0: No State Change (NOP) 
1: 1x 
2:4x 

3: 1xor4x 

O. I^X 

9: 1xor 12x 
10: 4x or 12x 
11: 1x, 4xor 12x 

4 - 7, 12 - 254: Reserved (Ignored) 
255: Set to LinkWidthSupported value. 
When writing this field, only legal transitions are 
valid. See Volume 2. 


LinkWidthSupported^ 


RO 


8 


240 


Supported link width, indicated as follows: 
1: 1x 

3: 1xor4x 

11: 1x, 4x or 12x 

0, 2. 4-10, 12-255: Reserved 


LinkWidthActive^ 


RO 


8 


248 


Currently active link width, indicated as follows: 
1: 1x 
2:4x 
8: 12x 

0, 3, 4-7. 9-255: Reserved 


LinkSpeedSupported*^ 


RO 


4 


256 


Supported link speed, indicated as follows: 

1:2.5Gbps 

0. 2 - 15: reserved 


PortState^ 


RW 


4 


260 


Port State. Enumerated as: 
0: No State Change (NOP) 
1 : Down (includes failed links) 
2: Initialize 
3: Armed 
4: Active 

5: 1 5: Reserved - ignored 
When writing this field, only legal transitions are 
valid. See section 14.3.5 Port State Chance on 
oaae 652. 
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Table 127 Portlnfo 



Component 


Access 


Length(bits) 


Offset (bits) 


Description 


PortPhysicalState** 


RW 


4 


264 


0: No state change 
1: Sleep 

9* Pnllinn 

3: Disabled 

4: PortConfigurationTraining 
5: LInkUp 

6: LInkErrorRecoveryO 

7-15: Reserved - ignored 

When writing this field, only legal transitions are 

valid. See Volume 2. 


LinkDownDefault- 
State^ 


RW 


4 


268 


0: No state change 
l! Sleep 
2: Polling 

3-15: Reserved - ignored 

When writing this field, only legal transitions are 

valid. See Volume 2. 


M_KeyProtectBits® 


RW 


2 


272 


See section Section 14.2.4 on paqe 623. 


Reserved 


RO 


3 


274 


Reserved, shall be zero. 


LMC^ 


RW 


3 


111 


LID mask for multipath support, Its usage Is 
described In 7.11 Subnet MultiDathino on pace 185. 


LinkSpeedActive*^ 


RO 


4 


280 


Currently active link speed, indicated as follows: 

1: 2.5Gbps 

0, 2 - 15: reserved 


LinkSpeedEnabled*^ 


RW 


4 


284 


Enabled link speed, indicated as follows: 
0: No State Change (NOP) 
1:2.5Gbps 

2-14: Reserved (Ignored) 

15: Set to LinkSpeedSupported value 

When writing this field, only legal transitions are 

valid. See Volume 2. 


Kloinhhr^r^y1TI 1^ 

rMciynDorivi i u 


R\A/ 
KVV 


A 


Zoo 


Active maximum MTU enabled on this port for trans- 
mit: 
1:256 
2:512 
3: 1024 
4: 2048 
5: 4096 

0, 6 - 15: reserved 


MasterSMSL^ 


RW 


4 


292 


The administrative SL of the master SM that is man- 
aging this port. 
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Table 127 Portlnfo 



Component 


Access 


Length(bits) 


Offset (bits) 


Description 


VLCap*^ 


RO 


4 


296 


Virtual Lanes supported on this port, indicated as 

follows: 

1: VLO 

2: VLO, VL1 

3: VLO - VL3 

4: VLO - VL7 

5: VLO- VL14 

0, 6 - 15: reserved 


Reserved 


RO 


4 


300 


Reserved, shall be zero. 


VLHighLimit*' 


RW 


8 


304 


Limit of High Priority component of VL Arbitration 
Table, as defined in section 7.6.9 VL Arbitration and 
Prioritization on paae 154. 


VLArbitrationHighCap^ 


RO 


8 


312 


VL/Weight pairs supported on this port in the VLAr- 
bitration table for high priority. Shall be 1 to 64 if 
more than one data VL is supported on this port, 0 
otherwise. See section 7.6.9 VL Arbitration and Pri- 
oritization on Daoe 154. 


VLArbitrationLowCap^ 


RO 


8 


320 


VL/Weight pairs supported on this port in the VLAr- 
bitratlon table for low priority. Shall be N to 64 if 
more than one data VL is supported on this port, 0 
otherwise, N being the number of data VLs sup- 
Dorted. See section 7.6.9 VL Arbitration and Prioriti- 
zation on oaoe 154. 


Reserved 


RO 


4 


328 


Reserved, shall be zero. 


MTUCap^ 


RO 


4 


332 


Maximum MTU supported by this port. 

1:256 

2:512 

4: 2048 
5: 4096 

0, 6 - 15: reserved 


VLStallCount^ 


RW 


3 


336 


Specifies the number of sequential packets dropped 
that causes the port to enter the VLStalled state. 
Refer to section 18.2.4.4 Transmitter Queuina on 
oaae 826 for details. 


HOQLife^ 


RW 


5 


339 


Sets the time a packet can live at the head of a VL 
Queue. Refer to section 18.2.5.4 Transmitter 
Queueina on paae 828 for details. 
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Table 127 Portlnfo 



Component 


Access 


Length(bits) 


Offset (bits) 


Description 


OperationalVLs^ 


RW 


4 


344 


Virtual Lanes operational on this port, indicated as 

0: No change 

1: VLO 

2: VLO. VL1 

O. V L.\J V l_0 

4: VLO - VL7 
5: VL0-VL14 
6-15: reserved 


PartitionEnforce- 
mentlnbound^ 


RW 


1 


348 


Indicates support of optional partition enforcement. 
If set to one, enables partition enforcement on pack- 
ets received on this port. Zero disables partition 
enforcement on packets received from this port. 


PartitlonEnforce- 
mentOutbound^ 


RW 


1 


349 


Indicates support of optional partition enforcement. 
If set to one, enables partition enforcement on pack- 
ets transmitted from this port. Zero disables partition 
enforcement on packets transmitted from this port. 


FilterRawPacketln- 
bound^ 


RW 


1 


350 


Indicates support of optional raw packet enforce- 
ment. If set to 1 , raw packets arriving on this port 
are discarded. Zero disables raw enforcement on 
packets received from this port. 


FilterRawPacketOut- 
bound^ 


RW 


1 


351 


Indicates support of optional raw packet enforce- 
ment. If set to 1, raw packets departing on this port 
are discarded. Zero disables raw enforcement on 
packets received from this port. 


M_KeyViolations^ 


RW 


16 


352 


Counts the number of SMP packets that have been 
received at this port that have had invalid M_Keys, 
since power-on or reset. Increments till count 
reaches all Is and then must be set back to zero to 
re-enable incrementing. 


P_KeyViolations^ 


RW 


16 


368 


Counts the number of packets that have been 
received at this port that have had invalid P_Keys, 
since oower-on or reset. Refer to section 10.9.4 on 
oaae 430 for usaae descriotion. Increments till 
count reaches all Is and then must be set back to 
zero to re-enable incrementing. 


Q_KeyViolations^ 


RW 


16 


384 


Counts the number of packets that have been 
received at this port that have had invalid Q_Keys, 
since oower-on or reset. See section 10.2.4 on 
paoe 376 for usaae description. Increments till 
count reaches all Is and then must be set back to 
zero to re-enable incrementing. 


GUIDCap^ 


RO 


8 


400 


Number of GUID entries supported in the GUIDInfo 
attribute for this port. 



InfiniBand^"^ Trade Association 



Page 639 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Subnet Management 



October 24. 2000 
FINAL 



Table 127 Portlnfo 



Component 


Access 


Length(bits) 


Offset (bits) 


Description 


Reserved 


RO 


3 


408 


Reserved, shall be zero. 


SubnetTimeOut^ 


RW 


5 


411 


Specifies the maximum expected subnet propaga- 
tion delay, which depends upon the configuration of 
the switches, to reach any other port in the subnet 
and shall also be used to determine the maximum 
rate which SubnTraps() can be sent from this port. 
The duration of time is calculated based on (4.096 
u 3*2^*^^"®*^"^®^"^) 


Reserved 


RO 


3 


416 


Reserved, shall be zero. 


RespTimeValue® 


RO 


5 


419 


Specifies the expected maximum time between the 
port reception of a SMP and the transmission of the 
associated response. The duration of time is calcu- 
lated based on (4.096 uS*2^®^P'^"^®^3'"®). The max- 
imum value shall be 8. 


LocalPhyErrors^ 


RW 


4 


424 


Threshold value. When the count of marginal link 
errors exceeds this threshold, the local link integrity 
error shall be detected as described in section 
7.12.2 Error Recovery Procedures on oaae 187. 


OverrunErrors^ 


RW 


4 


428 


Threshold value. When the count of buffer overruns 
over consecutive flow control update periods 
exceeds this threshold, the excessive buffer overrun 
error shall be detected as described in section 
7.12.2 Error Recoverv Procedures on oaae 187. 
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a. Applies to channel adapter and router ports and the management port on switches; unused othenwise. 

b. Applies to channel adapter and router ports and to all switch ports except the management port; unused otherwise. 

c. Applies to switch ports only; unused otherwise. 

d. Applies to channel adapter and router ports. 

14.2.5.6.1 Interpretation of DiagCode 

The 16-bit Portlnfo:DiagCode field provides both generic and vendor-spe- 
cific diagnostic functionality. For all ports, all bits set to zero means the 
port status is good. Any non-zero value means there are possible error 
conditions. 

The Portlnfo:DiagCode can provide three levels of diagnostic data: 
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• A high level, universal status provided by all ports. A Portlnfo:Diag' 
Code of all zeroes indicates no exception conditions exist on the port. 
Bits 3-0 of Portlnfo:DiagCode, when non-zero, have the same mean- 
ing for all ports and are defined in Table 128 Standard Encoding of Pi- 
aqCode Bits 3-0 on oaae 641 below. 

• An optional, high level vendor-specific diagnostic code in bits 14-4 of 
Portlnfo:DiagCode. Interpretation of this field requires knowledge of 
the port diagnostic codes. 

• An optional, more detailed vendor-specific port attribute pointed to by 
Portlnfo.DiagCode. Availability of this information is indicated by bit 
15 of Portlnfo:DiagCode and the pertinent port attribute is then point- 
ed to by bits 14-4. 

Figure 154 DiaaCode Fields on page 641 summarizes the structure and 
interpretation of Portlnfo:DiagCode fields. 



Portlnfo 




Bit 15 



DiagCode 



BitO 



0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 



Bits 3-0 are re- 
served for diag- 
nostic codes 
common to all 
ports. 

Bits 14-4 pro- 
vide vendor- 
specific diag- 
nostic codes. 

Bit 15 provides 
a means to 
chain vendor di- 



Figure154 DiagCode Fields 



The error information in bits 3-0 is interpreted in a standard fashion for all 
ports as shown in Table 128 Standard Encoding of DiagCode Bits 3-0 on 
page 641 . 

Table 128 Standard Encoding of DiagCode Bits 3-0 



DiagCode Bits 3-0 


Description 


0x0 


Port Ready 


0x1 


Performing Self Test 


0x2 


Initializing 
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Table 128 Standard Encoding of DiagCode Bits 3-0 



14.2.5,7 P KeyTable 



DiagCode Bits 3-0 


Description 


0x3 


Soft Error - Port Has A Non-Fatal Error And May Be Used 


0x4 


Hard Error - Port May Not Be Used 


0x5 - OxF 


Reserved 



Bits 14-4 of Portlnfo:DiagCode are used for vendor-specific modifiers to 
the standard diagnostic information, as shown in Fioure 155 DiaaCode 
Bits on page 642 . 
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Bits 4-14 are vendor speciflc 
information regarding the Hard Fail 



\ / 
\ / 

V 

0x04 

Indicates Hard Failure 



Generic Fail indicator with 
vendor specific modifiers 



Figure 155 DiagCode Bits 



If bit 15, the IndexForward bit, is zero (0), bits 14-4 of the Portlnfo:Diag- 
Code represent a vendor-specific diagnostic code. Interpretation of the re- 
turned information is outside the scope of this specification. Further 
diagnostic information might be provided in known locations in one or 
more vendor-specific Attributes. 

If the IndexForward bit is set, bits 14-4 of the Portlnfo:DiagCode field are 
used to index into the VendorDiag Attribute data for vendor-specific diag- 
nostic information. This allows dynamic chaining of diagnostic information 
based on the type of exception. Bits 14-4 are interpreted as an Attribute 
Modifier to be specified with an Subr)Get(VendorDiag) to the port being 
examined as defined in section 14.2.5.14 VendorDiag on page 648 . 



The P_KeyTable Attribute provides the means for assigning the P_Keys 
for ports. 
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The Attribute Modifier is divided in two halves: 

• The least significant 16 bits are a pointer to a block of 32 P_Key en- 
tries to which this Attribute applies. Valid values are 0 - 1023, and are 
further limited by the size of the P_Key table (specified by the Parti- 
tionCap for CAs, routers, and switch management ports or Partition- 
EnforcementCap for external ports on switches) for that node and 
any entries in the block beyond the end of the table are read-only and 
set to 0. 

• For switches, the upper 16 bits select the switch port, where valid val- 
ues are 1 - 255 to select physical ports and zero to select the switch 
management port. 

For CA and router, the upper 16 bits are ignored and the operation is 
performed on the port that received the SMP. 

Table 129 P_KeyTable 



Component 


Access 


Length (bits) 


Description 


P_KeyTable 
Block 


RW 


512 


List of 32 P_Key Block Elements. 


Table 130 P_Key Block Element 


Component 


Length (bits) 


Description 


Membership- 
Type 


1 


If set to zero, the P_Key is limited type and the endnode may 
accept a packet with a matching full P_Key, but may not accept a 
packets with a matching limited P_Key. If set to one, the P_Key is 
full type and the endnode may receive packets with matching full or 
limited P Kev. A full description is in section 10.9.1.1 on oaae 428. 


P_KeyBase 


15 


Base value of the P_Key that the endnode will use to check against 
incoming packets. 



14.2.5.8 SLtoVLMappingTable 



The SLtoVLMappingTable Attribute provides the means for setting the SL 
to VL Mapping of a switch, CA, and router and its usage is described in 
7.6.6 VL Mapping Within a Subnet on page 152 . 

For a switch, this attribute is specific to an input port / output port combi- 
nation to which the specific SL to VL mapping applies: 
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• bits 15-8 of the Attribute Modifier specify the input port which can be 
1 to N, where N selects the physical port or 0 to indicate that the input 
port is the management port. 

• bits 7-0 of the Attribute Modifier specify the output port which can be 
1 to N, where N selects the physical port. 

• bits 31-16 must be zero. 

For CA and router, this attribute corresponds to the port receiving the SMP 
(the Attribute Modifier is ignored). 

Table 131 SLtoVLMappingTable 



wUIII|JUI ICI li 






Off c At /hitc) 


L/escripiion 


0 1 f\A^\ It 

SLOtoVL 


RW 


4 


0 


The number of the VL on which packets using SLO 
are output. 15 forces the packets to be dropped. 


SLItoVL 


RW 


4 


4 


The VL associated with SLI 


SL2toVL 


RW 


4 


8 


The VL associated with SL2 


SL3toVL 


RW 


4 


12 


The VL associated with SL3 


SL4toVL 


RW 


4 


16 


The VL associated with SL4 


SLStoVL 


RW 


4 


20 


The VL associated with SL5 


SL6toVL 


RW 


4 


24 


The VL associated with SL6 


SL7toVL 


RW 


4 


28 


The VL associated with SL7 


SL8toVL 


RW 


4 


32 


The VL associated with SL8 


SL9toVL 


RW 


4 


36 


The VL associated with SL9 


SLIOtoVL 


RW 


4 


40 


The VL associated with SL10 


SLIOtoVL 


RW 


4 


44 


The VL associated with SL11 


SL12toVL 


RW 


4 


48 


The VL associated with SLI 2 


SL13toVL 


RW 


4 


52 


The VL associated with SLI 3 


SL14toVL 


RW 


4 


56 


The VL associated with SL14 


SL15toVL 


RW 


4 


60 


The VL associated with SLI 5 
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14.2.5.9 VLArbitrationTable 



The VLArbitrationTable Attribute provides the means for setting the VL Ar- 
bitration for ports on CA, routers and switches and its usage is described 
in 7.6.9 VL Arbitration and Prioritization on page 154 (xref to section 
7.6.9). 

The Attribute Modifier is divided in two halves. The upper 16 bits specify 
the part of the tables that is accessed. 
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1 - lower 32 entries of the low priority VL Arbitration Table. 

2 - upper 32 entries of the low priority VL Arbitration Table. 

3 - lower 32 entries of the high priority VL Arbitration Table. 

4 - upper 32 entries of the high priority VL Arbitration Table. 

0, 5-65535 -reserved. 

For switches, the least significant 16 bits of the attribute modifier specify 
the external port in bits 7-0 and bits 15-8 are reserved and must be set to 

0. 

For CA and router, this attribute corresponds to the port receiving the SMP 
(the lower 16 bits of the Attribute Modifier is ignored). 

Table 132 VLArbitrationTable 



Component 


Access 


Length (bits) 


Offset (bits) 


Description 


VL/Weight 
pairs 


RW 


512 


0 


Lists of 32 VL/Weight Block elements, for which there 
may be up to 64 in total for a given priority. The inter- 
pretation is as follows: 

1 - values 0 -31 of low priority 

2 - values 32 -63 of low priority 

3 - values 0 - 31 of high priority 

4 - values 32 -63 of high priority 


Table 133 VL/Weight Block Element 
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Component 


Length (bits) 


Offset (bits) 


Description 


reserved 


4 


0 




VL 


4 


4 


VL associated with element. 


Weight 


8 


8 


Weight associated with element, as defined in section 
7.6.9 VL Arbitration and Prioritization on oaae 154. 
zero indicates that this element is skipped. 



14-2.5.10 LinearForwardingTable 



The LinearForwardingTable Attribute provides the means for setting the 
linear forwarding table of a switch for the Unicast LIDs. 

The Attribute Modifier is a pointer to a block of 64 LIDs to which this at- 
tribute applies. Valid values are from 0 to 767, and are further limited by 
the size of the Linear Forwarding Table of the switch. Any entries in the 
block beyond the end of the table are ignored on write and read back as 
zero. If an invalid port number is written into an entry, packets sent to this 
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LID will be discarded and that entry shall be read back as OxFF to indicate 
that an invalid port number was used. 

Table 134 LinearForwardingTable 



Component 


Access 


Length(bits) 


Description 


LinearFor- 
wardingTable 
Block 


RW 


512 


List of 64 Port Block Elements. 



Table 135 Port Block Element 



Component 


Length(bits) 


Description 


Port 


8 


Port to which packets with the LID corresponding to this 
entry are to be forwarded. 



14.2.5.11 RandomForwardingTable 



The RandomForwardingTable Attribute provides the means for setting the 
random forwarding table of a switch for the Unicast LIDs. 

The Attribute Modifier is a pointer to a block of 16 LID/port pairs to which 
this Attribute applies. Valid values are from 0 to 3071 , and are further lim- 
ited by the size of the Random Forwarding Table of the switch and any en- 
tries in the block beyond the end of the table are read-only and set to 0. 

Table 136 RandomForwardingTable 



Component 


Access 


Length(bits) 


Description 


RandomFor- 
wardingTable 
Block 


RW 


512 


List of 16 LID/Port Block Elements. 



Table 137 LID/Port Block Element 



Component 


Length(bits) 


Offset (bits) 


Description 


LID 


16 


0 


Base LID. 


Valid 


1 


16 


This LID/Port pair is valid. Note that setting this parameter to 0 
allows the removal of entries. 


LMC 


3 


17 


the LMC of this LID. 


Reserved 


4 


20 




Port 


8 


24 


Port to which packets with this LID/LMC corresponding to this 
entry are to be forwarded. 
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14.2.5.12 MulticastForwardingTable 



This MulticastForwardingTable Attribute provides the means for setting 
the multicast forwarding table of a switch. 

The ten low-order bits of the Attribute Modifier are a pointer to a block of 
32 PortMask entries to which this attribute applies. Valid values are from 
0 to 511 , and are further limited by the size of the Multicast Forwarding 
Table of the switch. Any entries in the block beyond the end of the table 
are read-only and set to 0. 

The four high-order bits of the Attribute Modifier indicate the position (p) 
of the 16-bit PortMask entry of this Attribute. Each PortMask entry speci- 
fies only 1 6 bits of the 256 possible bits of a port mask of a maximum size 
switch. The remaining 18 bits of the Attribute Modifier shall be set to zero. 

Table 138 MulticastForwardingTable 



Component 


Access 


Length(bits) 


Description 


MulticastFor- 
wardingTable 
Block 


RW 


512 


List of 32 PortMask Block Elements. 



Table 139 PortMask Block Element 



Component 


Length(bits) 


Description 


PortMask 


16 


16 bits starting at position 16*p of the port mask associated 
with the particular LID. An incoming packet with this LID is 
forwarded to all ports for which the bit in the port mask is 
set to 1. Note that an invalid LID is indicated with an all zero 
PortMask. 



14.2.5.13 SMINFO 



The SMInfo attribute is used by Subnet Managers to exchange informa- 
tion during subnet discovery and polling as described in section 14.4 
Subnet Manager on page 655 . This attribute shall be available on a port 
where a Subnet Manager resides. 

Table 140 SMInfo 



Component 


Access 


Length (bits) 


Offset (bits) 


Description 


QUID 


RO 


64 


0 


PortGUID of the port where the SM resides. 


SM_Key 


RO 


64 


64 


Key of this SM. This is shown as 0 unless the 
requesting SM is proven to be the master, or the 
requester is otherwise authenticated. 
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Table 140 SMInfo 



Component 


Access 


Length (bits) 


Offset (bits) 


Description 


ActCount 


RO 


32 


96 


Counter that increments each time the SM issues an 
SMP or performs other management activities. Used 
as a "heartbeat" indicator by standby SMs. 


Priority 


RO 


4 


100 


Administratively assigned priority for this SM. Can be 
reset by master SM. 0 is lowest priority. 


SMState 


RO 


4 


104 


Enumerated value indicating this SM's state. Enumer- 
ated as follows: 

0 - not active 

1 - discovering 

2 - standby 

3 - master 
4-15 - Reserved 
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14.2.5.14 VENDORDiaG 



The VendorDiag Attribute provides a way to obtain vendor specific diag- 
nostic information. The interpretation of the VendorDiag:DiagData is spe- 
cific to the port in question. It is accessible from ports on CAs, routers, and 
the management port on a switch. 

Table 141 VendorDiag 



Component 


Access 


Length (bits) 


Offset (bits) 


Description 


Nextlndex 


RO 


16 


0 


Next Attribute Modifier to get to diagnostic Info. Set to 
zero if this is last or only diagnostic data. 


DiagData 


RO 


496 


16 


Vendor specific diagnostic information. Format is 
undefined. 



Section 14.2.5.6.1 Interpretation of DiaaCode on page 640 describes the 
use of the Portlnfo.DiagCode forwarding mechanism used to obtain the 
address modifier for the VendorDiag attribute during interpretation of diag- 
nostic codes. An example of the use of the IndexForward bit, bit 1 5 of the 
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14.2.5.15 LedInfo 



Portlnfo:DiagCode component, is shown in Figure 156 Index Forwarded 
Diagnostic Information on page 649 . 




If set, indicates new 
Attribute Modifier (AM) 
to use for next 
VendorDiag "Get" 



Get VendorDiag with 
Attribute Modifier S for 
Vendor Data 



Error Code 4 
HARD Fail 




VendorDiag, AM = 6 



NEXTINDEX = 0 



DIAGDATA 




VendorDiag, AM = 5 



NEXTINDEX = 6 



DIAGDATA 



Index Fonvarded Diagnostic 
Information 



Figure 156 Index Forwarded Diagnostic Information 



In the above example the Portlnfo:DiagCode with the IndexForward bit set 
indicates that VendorDiag Attribute Modifier 5 of this port contains vendor- 
specific diagnostic information. When VendorDiag Attribute Modifier 5 is 
retrieved, the VendorDiag:Nextlndex value indicates more data at At- 
tribute Modifier 6. The retrieval of Attribute Modifier 6 returns a Vendor- 
Diag:Nextlndex of 0, indicating the end of the diagnostic data. 



The LedInfo Attribute provides the ability to turn on or off a LED optionally 
provided by a CA, router, and switch using SMPs. This LED is not speci- 
fied and the implementation of this LED is vendor-specific. It has no asso- 
ciation with LEDs that are specified by this or other volumes of the IB 
specification. A CA, router, and switch shall indicate its support of this at- 
tribute in the Portlnfo:CababilityMask. 
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Table 142 Ledlnfo 1 



Component 


Type 


Length (bits) 


Offset (bits) 


Description 


3 


LedMasI^ 


RW 


1 


0 


Set to 1 for LED on, and 0 for LED off. The response 


4 










packet shall indicate actual LED state. 


5 


Reserved 


RO 


31 


1 
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14.3 Subnet Management Agent 



14.3.1 SubnGet 



14.3.2 SubnSet 



14.3.3 SubnGetResp 



Each CA and router, and switch will have a Subnet Management Agent 
(SMA) that communicates with the SMI and SM as described in section 
13.3.2 Required Managers and Agents on page 572 . The SMA will re- 
spond and generate SMPs as described in Table 110 SM MAD Sources 
and Destinations on paoe 606 . This section describes the detailed re- 
quirements of SMA behavior where the operations defined below assume 
the receipt of a valid SMP. A SMP is valid if it satisfies all applicable vali- 
dation checks as specified in Section 13.5.3 MAD Validation on oaoe 605 . 



A SMA may receive a SMP from the subnet containing a SubnGet at any 
time. The requester, the master SM, will fill the MADHeader:M_Keyf\e\d 
of the SMP header with a M_Key that matches the value of the M_Key of 
the port corresponding to the receiving SMA if it expects the receiving 
SMA to check it. 

A SMP containing a SubnGetResp is returned according to the rules in 
section 14.3.3 SubnGetResp on page 651 . 



An SMA may receive a SMP from the subnet containing a SubnSet at any 
time. The requester, the master SM, will fill the MADHeader:M_Keyf\e\6 
of the SMP header with a M_Key that matches the value of the M_Key of 
the port corresponding to the receiving SMA if it expects the receiving 
SMA to check it. 

CI 4-25: If the Portlnfo:M_Key component is zero, the SMA shall update 
the appropriate registers with the contents of the attribute contained in the 
SMP 

CI 4-26: If the Po/t/n/b./W_Key component is non-zero and M_Key 
matching, if required, is successful according to the rules specified in sec- 
tion 14.2.4 Management Kev on page 623 . the SMA shall update the ap- 
propriate registers with the contents of the attribute contained in the SMP. 

CI 4-27: The SMA shall ignore a request to change non-settable (RO) reg- 
isters. Also, the SMA shall ignore a request to change a settable register 
to an illegal value. 

A SMP containing a SubnGetResp is returned according to the rules in 
section 14.3.3 SubnGetResp on page 651 . 



CI 4-28: If the Portlnfo:M_Key component is zero, then the SMA shall 
generate a SubnGetResp. 
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14.3.4 SubnTrap 



14.3.5 Port State Change 



C14-29: If the Portlnfo:M_Key component is non-zero and M_Key 
matching, if required, is successful according to the rules specified in sec- 
tion 14.2.4 Manaoement Key on oaae 623 . then the SMA shall generate 
a SubnGetResp, otherwise the request is silently discarded. 

C14-30: If the SMA generates a SubnGetResp, it shall fill the attribute 
identified in the request with the appropriate contents of register informa- 
tion. 

C14-31: If the SMA generates a SubnGetResp, it shall use the MAD- 
HeadenTransactionlD obtained from the request SMP in the response 
SMP. 

C14-32: If the SMA generates a SubnGetResp, it shall fill the MAD- 
Header:M_Key in the SMP header with zero. 

If the SMA generates a SubnGetResp, it should send the SMP containing 
the SubnGetResp in less than Portlnfo:RespTimeValue of the receiving 
port, where requirements for response time are described in section 
13.4.6.2 Timers and Timeouts on oaae 583 . 

After transmission of the response, the SMA discards any residual state 
associated with that SMP. 



Traps may be issued by any port on the subnet. Ports that support this 
mechanism will indicate this by setting the Portlnfo:CapabilityMask:ls- 
TrapSupported bit. 

o14-1 : If the SMA generates a SubnTrap, it shall fill the M_Key field of the 
SMP with zero. 

o14-2: If the SMA generates a sequence of traps, it shall not be sent at an 
interval smaller than the subnet timeout, which is specified by the Port- 
Info.SubnetTimeOut component. 

This mechanism is used to limit the number of traps sent on the subnet. 

o14-3: If the SMA generates a trap, it shall only send it when the Port- 
InfoiPortstate is Active. 

This section describes the application of the architected traps for subnet 
management event reporting. The entire list of subnet management class 
traps are described in section 14.2.5.1 Notice on page 628 . 



Switches are capable of reporting port state changes. 
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014-4: If required to send a trap or log a notice, the SMA residing on the 1 
management port of a switch shall monitor the Switchlnfo:PortState- 2 
Change bit for a transition from zero to one. 3 



C14-33: The SMA shall monitor P_Key and Q_Key mismatches detected 
by the transport services on that port. 



14,3.7 M Key MISMATCH 



4 
5 



o14-5: If the management port supports Traps as indicated in the Port- 
lnfo:CapabilityMask.lsTrapSupported, the SMA shall send a trap 128 to 
the SM indicated by the Portlnfo:MasterSMLID for a port state change on ^ 
a switch. 7 

8 

o14-6: If the management port supports Notices as indicated the Port- g 
InfoiCapabilityMask.lsNoticeSupportedy the SMA shall log a notice for a 
port state change on a switch. 

The contents of the trap or notice is filled with information from Table 118 ^ ^ 
Notice DataPetaiis For Trap 128 on pace 629 . 13 

14 

14.3.6 Transport Key Mismatch 15 

Transport key mismatch happens when a key residing in the headers of 16 

an incoming packet does not match the key for the destination QP during 1 7 

packet validation as described in the section 9.6 Packet Transport Header ^ g 

Validation on page 228 of the transport chapter. ^ g 

20 
21 
22 

014-34: If a P_Key or Q_Key mismatch occurs, the SMA shall report the 23 
current count via the contents of Portlnfo:P_KeyViolations or Port- 24 
lnfo:Q_KeyViolations components of the Portlnfo attribute (see section 25 
14.2.5.6 Portlnfo on pace 633 V 25 

o14-7: If the port supports Traps as indicated in the Portlnfo:Capability- 

MaskJsTrapSupported, the SMA shall send a trap 257 or 258 to the SM 28 

indicated by the Portlnfo:MasterSMLID for P_Key and Q_Key mis- 29 

matches, respectively. 30 

31 

0I4-8: If the port supports Notices as indicated the Portlnfo:Capability- 32 
Mask.lsNoticeSupported, the SMA shall log a notice for P_Key and 
Q Key mismatches. 

34 

The contents of the trap or notice is filled with information from Table 121 35 
Notice DataPetaiis For Traps 257 and 258 on pace 629 for P_Key and 36 
Q_Key mismatches, respectively. 37 

38 
39 

As a result of the M_Key residing in the SMP header, the SMA is respon- 40 
sible for checking it. The SMA will compare the MADHeader:M_Key \n the 
SMP with the contents of the Portlnfo:M_Key component of the port 
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where it received the SIVIP. If a mismatch occurs, it starts a lease period 1 
countdown as described section 14.2.4.2 Lease Period on page 624 . 2 



14.3.8 Link Layer Errors 



o14-9: If the port supports Traps as indicated in the Portlnfo:Capability- 
Mask.lsTrapSupported, the SMA shall send a trap 256 to the SM indicated 
by the Portlnfo:MasterSMLID when a M_Key mismatch is detected. 

O14-10: If the port supports Notices as indicated the Portlnfo:Capability- 
Mask.lsNoticeSupported, the SMA shall log a notice when a M_Key mis- 
match is detected. 

The contents of the trap or notice is filled with information from Table 1 20 
Notice DataPetails For Trap 256 on pace 629 . 



The link layer performs error detection and recovery as described in sec- 
tion 7.12 Error detection and handling on page 185 . The SMA is respon- 
sible for monitoring the Local link integrity, excessive buffer overrun, and 
flow control update counters of the link. 

o14-11: If the port supports Traps as indicated in the Portlnfo.Capability- 
Mask.lsTrapSupported, the SMA shall send a trap 129, 130, or 131 to the 
SM indicated by the Portlnfo:MasterSMLID when the Local link integrity, 
excessive buffer overrun, and flow control update counters increment, re- 
spectively. 

o14-12: If the port supports Notices as indicated the Portlnfo:Capability- 
Mask.lsNoticeSupported, the SMA shall log a notice when the Local link 
integrity, excessive buffer overrun, or flow control update counters incre- 
ment. 

The contents of the trap or notice is filled with information from Table 1 1 9 
Notice DataPetails For Traps 129. 130 and 131 on page 629 for Local link 
integrity, excessive buffer overrun, and flow control update counter 
changes, respectively. 
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14.4 Subnet Manager 



14.4.1 SM State Machine 



There may be one or more Subnet Managers operating on a subnet as 
described in section 13.3.2 Required Managers and Agents on oaae 572 . 
Each Subnet Manager (SM) indicates its presence on the subnet by set- 
ting the IsSM bit in the Portlnfo:CapabilityMask on the port where it re- 
sides (see Table 127 Portlnfo on page 634 ). There may be several SMs 
on a particular node, each residing on different subnets. 

C14-35: An SM shall always be associated with one port and one subnet. 

Each SM is always in a particular state: Master, Standby, Discovering or 
Not-active. 

The algorithm used to initialize the subnet, the algorithm for adding/de- 
leting routes in response to subnet changes, the mechanisms for failover 
from master SM to standby SM, and the mechanism for transfer of mas- 
tership from master SM to standby SM are beyond the scope of the spec- 
ification. However, there are mechanisms specified in this section that 
may be used to support these operations. 

C14-36: A SM shall comply with the state machine shown in Figure 157 
SMInfo State Transitions on oaoe 656 during its startup and shall become 
either a master or standby SM. 

Correct execution of the state machine ensures that there be only one 
Master SM on a subnet at any time and that after startup, a SM becomes 
either a Standby or Master on the subnet. 

Furthermore, the state machine specifies how a single Master SM is main- 
tained during subnet topology changes, packet loss, addition/removal of 
SMs, and subnet mergers. Subsequent sections include the specification 
of optional mechanism that may be used by SMs to communicate and a 
description of some SM operations on the subnet, but none of these are 
required for SM compliance. 



The behavior of the SM is specified in terms of the SM state machine. This 
section starts by defining the specific mechanisms used by the SM: the 
SMInfo attribute, control packets that SMs may exchange, a set of timers, 
and the exception conditions reported to the higher layer (administrator). 

Each SM is will provide a SMInfo attribute that is specified in Table 140 
SMInfo on page 647 and is exported from the port where it resides. 



C14-37: The SMInfo:Priority, SMInfo.GUID and S/W/n/b.S/W_Key shall be 
configurable through an out-of-band mechanism that is outside the scope 
of this specification. 
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The contents of the components in the SMInfo attribute determine which 1 
SM in a multi-SM subnet becomes Master: the one with the highest Pri- 2 
ority and the lowest GUID. 3 

4 

Each Standby SM should be ready to become Master when the current 
Master fails (or gets disconnected). Also, mastership will be handed over 
when the Master detects another SM with a higher Priority (or same Pri- ^ 
ority and lower GUID), e.g., during merger of two subnets. Handover takes 7 
place only between SMs that have the right SM_Key. 8 

9 

Under certain circumstances, e.g., when the number of Standby SMs be- 
comes an obstacle to scaleability, then a Master SM may force other SMs 
to become Not-active. 

12 

C14-38: In order to assure interoperability, each SM shall respond to Sub- 13 
nGet(SMInfo) or SubnSet(SMInfo) with a SubnGetResp(SMInfo). 14 

15 

Figure 157 SMInfo State Transitions on page 656 summarizes the states ^ 5 
that a SM may represent in the SMInfo:SMState, ^ 7 
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Figure 157 SMInfo State 39 
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14.4.1.1 Control Packets 



The state transitions correspond to the event numbers in the text below. 
The following sections describe the behavior of the SM in each of the 
states and the (externally driven) events that cause state changes. 



Control packets may be exchanged between SMs using a SMP that con- 
tains a SubnSet(SMInfo) where the MADHeadenAttributeModifier is used 
to select from one of the following actions specified in Table 143 SM Con- 
trol Packets on page 657 . The SM is not required to generate these con- 
trol packets and may use mechanisms that are beyond the scope of the 
specification to implement similar functions, however, a SM is required to 
correctly respond to them. 

Table 143 SM Control Packets 



MADHeader:AttributeModtfier 


Description 


1 


HANDOVER: Is used to Initiate the process of handing over Mas- 
tership to a higher priority Standby SM or Master. 


2 


ACKNOWLEDGE: Is used to acknowledge the handover 


3 


DISABLE: Is used to disable a Standby SM. 


4 


STANDBY: Is used to return a Not-active SM to Standby. 


5 


DISCOVER: Causes a Standby SM to go to Discovering. 



14.4.1.2 Discovering State 



DISCOVERING is the initial state. 

C14-39: At startup, a SM shall enter the DISCOVERING state. 

C14-40: In the DISCOVERING state, the SM shall perform repetitive Sub- 
nGet(*) to find all nodes and SMs on the subnet. 

Section 14.4.2 Subnet Discoverv Actions on page 661 summaries many 
of the attributes that are collected during discovery. The SM will typically 
use direct-routed SMPs to reach all the endnodes. The sequence of dis- 
covery is implementation specific and beyond the scope of the specifica- 
tion. 

014-41: If the SM in the DISCOVERING state finds another SM with a 
higher Priority than its own, or with the same Priority and a lower GUID, 
or with a SMInfo:SMState = MASTER, then the SM shall yield and change 
its SMInfo:SMState to STANDBY. 

See Figure 157 SMInfo State Transitions on page 656 . number 1 . At this 
point the SM stops the discovery and starts operating as a Standby SM. 
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14.4.1.3 Standby State 



C14-42: If the SM in the DISCOVERING state completes the discovery 
process without finding a Master or a higher priority (lower GUID) SM, it 
shall assume the role of a Master by changing its SMInfo:State to 
MASTER. 

C14-43: The master SM shall initially send to all the nodes on the subnet 
SubnSet(Portlnfo) SMPs with MasterSMLID and MasterSMSL that 
specify a path to itself. 

See Figure 157 SMInfo State Transitions on oaoe 656 . number 2. 

C14-44: If the SM discovers that it does not have a M_Key required to 
configure a CA, switch, or router on the subnet it shall notify the higher- 
layer (through an interface beyond the scope of the specification). 



C14-45: Standby SMs shall not configure the subnet. 

C14-46: Each Standby SM shall poll the Master SM with Sub- 
nGet(SMInfo) SMPs, addressed to its Portlnfo:MasterSMLID. As long as 
the Standby determines that the Master is alive, it stays in SMInfo:SM- 
State = STANDBY. 

The minimum interval between polling is set by the higher-layer (through 
an interface beyond the scope of the specification). The actual interval 
may be longer for Standby SMs with lower Priority or when there is a 
larger number of Standby SMs on the subnet. The actual polling interval 
is installation specific and is not specified in the architecture. The Master 
may use the optional control packets to disable Standby SMs if it deter- 
mines that there is excessive polling in the subnet. 

C14-47: If the Standby SM does not receive a SubnGetResp(SMInfo) that 
indicates progress in the ActCount, within the number of retries that Is set 
by the higher-layer (through an interface beyond the scope of the specifi- 
cation), then it should conclude that the Master is no longer alive (or ac- 
cessible) and it shall change its SMInfo:SMState back to DISCOVERING. 

See Figure 157 SMInfo State Transitions on page 656 . number 3. 

014-48: If a Standby SM receives a DISCOVER packet, i.e. a 
SubnSet(SMInfo) with MADHeader: Attribute Modifier set to the value of 5, 
then it shall change its SMInfo:SMState to DISCOVERING. 

See Figure 157 SMInfo State Transitions on page 656 . number 4. 
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14.4.1.4 Not-Active State 



C14-49: If a Standby SM receives a DISABLE packet, i.e., a 
SubnSet(SMInfo) with MADHeader:AttributeModifierse\ to the value of 3, 
then it shall change its SMInfo:SMState to NOT-ACTIVE. 

See Figure 157 SMInfo State Transitions on page 656 . number 5. This al- 
lows the Master to disable Standby SMs if it determines that the amount 
of polling creates a scaleability problem. 

The Master SM may relinquish mastership of the subnet to a Standby with 
higher priority and correct SM_Key if it detects one. Event 6 specifies the 
Standby behavior during that transfer. The Master's behavior is specified 
in section 14.4.1 .5 Master State on pace 660 . 

C14-50: If a Standby SM receives a HANDOVER control packet, i.e., a 
SubnSet(SMInfo) with MADHeadenAttributeModifierseX to the value of 1 , 
then it should perform the following sequence of steps: 

1 ) It should have the necessary information, possibly obtaining subnet 
data from through the subnet administration residing with the current 
Master SM. 

2) It should send to all the nodes on the subnet a SubnSet(Portlnfo) with 
MasterSMLID and MasterSMSL that specify a path to itself. 

3) It should send the Master an ACKNOWLEDGE control packet, i.e. an 
SubnSet(SMInfo) with MADHeaderAttributeModifier set to the value 
of 2. 

4) It assumes the role of a Master by changing its SMInfo:State to 
MASTER. See Figure 157 SMInfo State Transitions on paoe 656 . 
number 6. 

5) If the new Master does not receive a SubnGetResp(SMInfo), it 
should notify the higher-layer (through an interface beyond the scope 
of the specification). This is an indication that the Master may have 
died in the middle of an unsuccessful handover. 



C14-51 : If a SM is in the NOT-ACTIVE state, it shall indicate this by setting 
the SMInfo:SMState to NOT-ACTIVE. 

C 14-52: If the SM is in the NOT-ACTIVE state, it shall not send SubnSetQ 
or SubnGetQ SMPs. 

C14-53: If the SM is in the NOT-ACTIVE state, it shall respond to 
SubnSet(SMInfo) and SubnGet(SMInfo) SMPs. 

014-54: If the SM in the NOT-ACTIVE state and it receives a STANDBY 
packet, i.e., a SubnSet(SMInfo) with MADHeaderAttributeModifier se\ to 
the value of 5, it shall change its state to STANDBY. 
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14.4.1,5 Master State 



See Figure 157 SMInfo State Transitions on page 656 . number 11 , 



The Master starts its operation by topology discovery, LID verification and 
assignment (if applicable), path verification and calculation, etc. (as spec- 
ified in section 14.4.3 Initialization Actions on page 662 ). 

C14-55: Only the Master SM shall configure subnet nodes. 

C14-56: The Master SM shall perform periodic sweeps of the subnet to 
check for changes in general, and for the appearance of new SMs, in par- 
ticular. 

C14-57: If the M_Key protection mechanism, as described in 14.2.4.1 
Levels of Protection on pace 624 . is being used, the Master SM shall 
sweep the subnet at a rate that will refresh the lease period of every port 
on the subnet. 

Section 14.4.5 Subnet Sweeping on page 666 describes the sweep activ- 
ities. 

C14-58: The Master shall increment the SMInfo.ActCount every time it 
performs a management operation or issues an SMP. 

When the SM in Master state receives a valid SubnGet(SMInfo) or 
SubnSet(SMInfo), it should respond with a SubnGetResp(SMInfo) when 
M_Key matching, if required, is successful as described in section 14.4.6 
Authentication on page 667 . This is required in order to support the 
Standby polling mechanism. 



C14-59: If during the sweep the Master detects a topology change, then 
it must perform the operations listed below: 

If the change is a link going down, then the Master needs to pos- 
sibly establish new paths and send new MasterSMLID/SLs to the 
affected nodes. The details are beyond the scope. 

• If the Master detects a new link, then it starts discovering the sub- 
net beyond the new links, using (partially) direct routed SMPs. 

If the SM discovers that it does not have a M_Key required to con- 
figure a CA, switch, or router on the subnet it will notify the higher- 
layer (through an interface beyond the scope of the specification). 

C14-60: If during the discovery it finds a Master with lower priority (or 
same priority and higher QUID), it shall stop the discovery, waiting for the 
other Master to relinquish control of its portion of the subnet. 
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C14-61 : If during the discovery the Master finds an SM that has a higher 1 

priority (or same priority and lower GUID) and it has the appropriate 2 

SM_Key, it shall complete the discovery in order to determine the highest 3 

priority SM (with an appropriate SM_Key) in the new part of the subnet (if ^ 

applicable) and it shall relinquish control of its portion of the subnet to that 

SM. ^ 

6 

The steps necessary to transfer control of the subnet from one master SM 7 

to another is beyond the scope of specification, however, the master SM 8 

may use the optional control packets to perform the handover process as 9 
follows: 



It may complete operations in progress. 

It sends the higher priority SM a HANDOVER packet, i.e. a 
SubnSet(SMInfo) with MADHeaderAttributeModifier set to the 
value of 1. 



10 
11 
12 
13 
14 
15 
16 
17 



It continues responding to polls from Standby SMs until it re- 
ceives an ACKNOWLEDGE packet, i.e., a SubnSet(SMInfo) with 
MADHeaderAttributeModifier sel to the value of 2 from the higher 
priority SM. 18 

1 Q 

• When it receives an ACKNOWLEDGE packet, it will change its 

SMInfo:SMState to STANDBY and return a SubnGetResp(SMIn- 20 
fo). See Fiaure 157 SMInfo State Transitions on page 656 . num- 21 
ber 9. 22 

If it does not receive an ACKNOWLEDGE packet, then it informs 23 
the higher-layer (through an interface beyond the scope of the 24 
specification). 25 

If a Master SM discovers a higher priority Master SM does not have the 26 

proper SM_Key, then it should not relinquish mastership of its portion of 27 

the subnet and it should report to the higher-layer (through an interface 28 

beyond the scope of the specification) that it discovered another Master 29 
SM on the same subnet. 

30 

14.4.2 Subnet Discovery Actions 

The SM collects information from the attributes. and records them for later 
use during configuration of the subnet. The discovery algorithm is outside 
the scope of the specification, however, discovery may consist of: 34 

35 

probing the subnet with directed route packets 36 

• loading a topology database from persistent storage 37 

• a combination of information that is loaded from persistent storage 

and obtained by probing subnet nodes 39 

40 

During discovery, the SM scans the attributes described in section 14.2.5 
Attributes on page 626 to obtain information not limited to the following: 

42 



32 
33 
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VLs on each Port 
MTU of the Port 
Link Width on each Port 
Link Speed on each Port 

Physical topology, connectivity of links between nodes 

P_Key table sizes 

GUID table sizes 

support for various capabilities 

device type: switch, CA, or router 

power-on diagnostic status 

for switches, 

• size of switch linear-forwarding or random-forwarding tables 

• support for multicast forwarding table and size 

• presence of the optional VL arbitration table 

• presence of the optional SL-to-VL mapping table 



14.4.3 Initialization Actions 



The algorithms and policy that are necessary to set many of the subnet 
attributes is outside the scope of the specification,. However, there is a 
core set of attributes that the SM is responsible for setting in order to make 
the subnet functional. 

C14-62: The Master SM shall initialize the subnet components specified 
in the following Table 144 Initialization on page 662 . 

Table 144 Initialization 



Component 


Description 


PortlnfoiLID 


The SM shall assign a unicast LID address for each CA, switch, and 
router port on the subnet. LID usaae is described in section 4.1 .3 
Local Identifiers on oaae 114. The LID ranaes assloned bv a SM mav 
be further limited bv the Ranae Record described in section 15.2.5.15 
RanaeRecord on oaae 686. 


Portlnfo;LMC 


The SM shall assign a LMC for each CA and router port on the sub- 
net. LMC usaae is described in section 4.1.3 Local Identifiers on paae 
114. 


Portlnfo:GidPrefix 


The SM shall assign a Subnet Prefix for the subnet based on the 
oresence of a router and the rules soecified in section 4.1.3 Local 
Identifiers on oaae 114. 
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Table 144 Initialization 



Component 


Description 


Portlnfo:OperationalVL 


The SM shall initialize the VL tables for CAs, switches and routers. 
The SM will examine the supported VLs in the Portlnfo:VLCap at both 
ends of every link and sets the maximum number of VLs by setting 
the Portlnfo:OperationalVL at each end to the smaller of the two sup- 
ported number of VLs. The description of VL initialization resides in 
section 7.6.7 Initialization and Conficuration on pace 153. 


Portinfo.NeighborMTU 


The SM shall initialize the port MTU for CAs, switches and routers. 
The SM examines the supported MTU size in the Portlnfo:MTUCap at 
both ends of every link and sets the maximum MTU parameter on the 
ports in the Portinfo.NeighborMTU at each end to the smaller of the 
two supported size. When a node powers on, it will set the MTU to 
256 bytes. 


Portlnfo.SubnetTimeOut 


The SM shall set the maximum trap generation rate for all nodes in 
the subnet by initializing the Portlnfo.SubnetTimeOut component in all 
Dorts as described in section 13.4.6.2.1 Port!nfo:SubnetTimeout on 
oaae 583. 


Portlnfo.RespTlmeValue 


The SM shall set the default response time used to calculate time- 
outs by initializing the Portlnfo.RespTlmeValue component in ail ports 
as dG<?rrihpri in ^prtinn I'H 4 fi 9 ? Rp^nTimpWahiP nn nano 


Portlnfo:MasterSMLID 


The SM ^hall ^tnrp thp 1 If^ nf thp nnrt whpr*a it rPQiH^Q in tho Pnrf- 

lnfo:MasterSMLID of each port on the subnet. 


Portlnfo:MasterSMSL 


The SM shall store the SL required for sending a non-SMP message 
to the SM using that LID in the Portlnfo:MasterSMSL of each port on 
the subnpt 


Portlnfo:PortPhysicalState 


The default state on power-on is polling as described in Volume 2. 


Portlnfo:LinkDownDefaultState 


The default state on power-on is polling as described in Volume 2. 


Portlnfo:VLHighLimit 


The SM shall set the Limit of High-Priority limit for the number of bytes 
of high-priority packets that can be transmitted if the ports on both 
ends of a link may be operated with multiple data VLs as described in 
section 7.6 Virtual Lanes Mechanisms on paae 146 


Portlnfo:M_Key 


The SM may initialize the Portlnfo:M_Key for each port on the subnet 
as described In section 16.2.3.1 ClassPortlnfo on oaae 756. The rules 
for assigning these values is outside the scope of the specification. 


Portlnfo:M_KeyProtectBits 


The SM may initialize the Portlnfo:M_KeyProtectBits for each port on 
the subnet as described in section 14.2.4 Manaoement Kev on oaae 
623. The rules for assigning these values is outside the scope of the 
specification. 


Portlnfo:M_KeyLease Period 


The SM may initialize the Portlnfo:M_KeyLeasePeriodior each port 
on the subnet as described in section 14.2.4 Manaoement Kev on 
oaae 623. The rules for assigning these values is out.qiHp thp srnpp nf 
the specification. 


Portlnfo:M_KeyViolations 


The SM shall clear the Portlnfo:M_KeyViolations component for all 
ports on the subnet. 
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Table 144 Initialization 



Component 


Description 


Port 1 nfo : P_Key Violations 


The SM shall clear the Portlnfo:P_KeyViolations component for all 
ports on the subnet. 


Portlnfo:Q_KeyViolations 


The SM shall clear the Portlnfo:Q_KeyViolations component for alt 
ports on the subnet. 


Portlnfo:VLStallCount 


The SM shall set a value for the Portlnfo:VLStallCount as described in 
section 16.2.3.1 ClassPortlnfo on oaae 756. The rules for assionino 
these values is outside the scope of the specification. 


Portlnfo:HOQLife 


The SM Shalt set a value for the Portlnfo:HOQUfe as described in 
section 16.3.3.1 ClassPortlnfo on oaoe 766. The rules for assiqninq 
these values is outside the scope of the specification. 


Portlnfo:DiagCode 


The SM may check the Portlnfo:DiagCode of every port on the sub- 
net. The rules for correcting faults detected on ports is outside the 
scope of the specification. 


GUIDInfo 


The SM may assign a GUID to ports to form GIDs as described in 
section 4.1.1 GID Usaae and Properties on oaae 110. There is one 
default GID for each port. The requirements for setting additional 
GIDs is beyond the scope of the specification. 


Switchlnfo:LinearFDBTop 


On a switch that supports a linear forwarding table, the SM will pro- 
gram the highest LID to port mapping used as described in section 
14.2.5.4 Switchlnfo on oaae 632. 


Switchlnfo:DefaultPort 


On a switch the support a random forwarding table, the SM must set 
the default port as described in section 18.2.4.3.2 Random Fonward- 
ina Table Reauirements on pace 823. The rules for assiqninq these 
values is outside the scope of the specification. 


SwitchlnfoiDefaultMulticastPri- 
maryPort 


On a switch that support multicast, the SM shall set the DefaultMulti- 
castPrimarvPort as described in section 18.2.4.3.3 Reauired Multicast 
Relav on oaae 824. The rules for assianing these values is outside 
the scope of the specification. 


Switchlnfo:DefaultMulticast- 
NotPrlmaryPort 


On a switch that support multicast, the SM shall set the DefaultMulti- 
castNotPrimarvPort as described in section 18.2.4.3.3 Reauired Multi- 
cast Relav on oaae 824. The rules for assionino these values is 
outside the scope of the specification. 


Switchlnfo:LlfeTimeValue 


The SM shall set a value for the LifeTimeValue as described in sec- 
tion 18.2.4.4 Transmitter Queuina on oaae 826. The rules for assian- 
ing these values is outside the scope of the specification. 


VLArbitrationTable 


VL arbitration described in section 7.6.9 VL Arbitration and Prioritiza- 
tion on paae 1 54 shall be set bv the SM for the output linl^ of each CA. 
switch, and router. 


SLtoVLmappingTable 


The aoDlication of VL is described in section 7.6.6 VL Maooina Within 
a Subnet on pace 152. The SM will Initialize the SL-to-VL mapping 
tables. The rules for assigning these values is outside the scope of 
the specification. For switches, the SM checks for the existence of the 
SLtoVLmappingTable and initializes it if present. 
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Table 144 Initialization 



Component 


Description 


P_KeyTable 


The SM may initialize the P_Key table by setting entries in the 
P_KeyTable attribute for ports It may also enable P_Key checking in 
switches. The policy for assigning P_Keys is outside the scope of the 
specification. 


LinearForwardingTable 


Unicast fonwarding tables will be set by the SM based on route policy 
decisions and Switch capabilities. The SM shall setup LID-to-port 
mappings if the Switch supports a Linear forwarding table as indicated 
by the Switchlnfo:LinearFDBCap component. 


RandomForwardingTable 


The SM shall setup LID/LMC range to port mappings based on route 
policy decisions and Switch capabilities, if the Switch supports a Ran- 
dom forwarding table as indicated by the Switchlnfo:RandomFDBCap 
component. 


MulticastForwardingTable 


The SM may setup LID to multi-port mappings based on route policy 
decisions and Switch capabilities, if the Switch supports a Multicast 
fonwarding table as indicated by the Switchlnfo:MulticastFDBCap 
component. 



14.4.4 Port State Transitions 



When power is applied to a device, its ports attempt to reach an opera- 
tional state according to the steps described in Volume 2, Chapter 5 and 
section 6.2 Services provided bvthe Physical Layer, on page 130 . A phys- 
ical subnet is established when a group of devices are connected together 
and the state of a set of ports reaches operational state. 

C14-63: A SM shall determine that a subnet is operational when the Port- 
lnfo:Portstate on the port where it resides is at the initialize state. 

The SM may access the management entities of remote CAs, switches, 
and routers while the ports along the physical links are in initialize state 
since the SMI on that port will recognize a packet on QPO and VL15, with 
a LID destination address OxFFFF as referring to the SMA. 

The SM may change the state of a port to active, armed, initialize or down 
that are described in section 14.4.4 Port State Transitions on page 665 . 

The SM may perform most port and device configuration activities while 
the Portlnfo.Portstate is in the initialize state. However, all control and con- 
figuration options are also available in the anried state and the active 
state. In addition to the link level behaviors, the Portlnfo.Portstate has an 
additional role cause it is manipulated by the SM to communicate to end- 
nodes the readiness of the subnet. CAs and Routers may start sending 
packets on the subnet if one of its ports enters the active state. As a result, 
moving a port from active state is likely to be disruptive to subnet activity. 
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An SM that becomes the master SM may enable transmission of packets 1 

through the subnet at any time. This is accomplished after establishing 2 

routes by setting the switch forwarding tables and initializing the other at- 3 

tributes as described in section 14.4.3 Initialization Actions on page 662 ^ 
for CAs, switches, and routers along those routes, and then setting the 
Portlnfo:Portstate to armed for the ports along those routes. The SM 

changes the state of an Endnode from armed to active to signal to the ^ 

Endnode that it may begin to send packets. Ports on switches and along 7 

that route and endnodes that are destinations of those packets will transi- 8 

tion from armed to active automatically as described in section 14.4.4 Port 9 
State Transitions on pace 665 . 



The SM may reset port related state by: 



The Portlnfo:Portstate should return to the initialize state after clearing its 
state as described by the link state machine in Figure 50 Link State Ma- 
chine on page 136 . 



10 
11 
12 

1 ) setting the Portlnfo:LinkDownDefaultState is set to polling 1 3 

14 

2) setting the PortlnfoiPortstate to the down state. 

16 
17 
18 

14A5 Subnet Sweeping i9 

20 
21 



C14-64: After the subnet is up and running, the SM shall periodically 
gather information about topology changes, Portlnfo:CapabilityMask 

changes, and Notices reported by nodes. 22 

23 

This is referred to as sweeping the subnet. The frequency of subnet 24 

sweeps is undefined for this architecture, as it will vary due to topology 25 
and other implementation considerations. 

The SM detects topology changes by examining the port state of nodes in 

the subnet. For example, when the value of the Portlnfo.Portstate compo- 28 

nent of a port changes from down to initialize, the SM will use directed 29 

routed packets to probe the other end of the link on that port to determine 30 

what has been added to the subnet. Conversely, if the Portlnfo.Portstate 3^ 

component changes from active to down, the SM may perform operations ^2 
such as updating switch forwarding tables to delete routes to the end- 
node(s) that are no longer accessible. To speed up detection of port state 

changes, switches support a Switchlnfo:PortStateChange component, 3^ 

described in Table 124 Switchlnfo on page 632 . that the SM may examine. 35 

If the state of this component indicates that the state of one of the switch 36 

ports has changed, the SM may proceed to check the status of each port 37 

on that switch. 3g 

39 
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14.4.6 Authentication 

During initialization of a SMP, the SM may fill in the MADHeader:M_Key 
field of the SMP with the value that matches the M_Key stored in the des- 
tination port if it expects the destination management entity to check it. 

C14-65: The SM shall not check the /W/ADHea(/er.M_/<ey stored in a Sub- 
nGetRespC). 

C14-66: If the SM receives a SMP containing a SubnSet(SMInfo) or Sub- 
nGet(SMInfo) and the Portlnfo:M_Key component is zero, then the SM 
shall generate a SubnGetResp. 

C14-67: If the SM receives a SMP containing a SubnSet(SMInfo) or Sub- 
nGet(SMInfo), and the Portlnfo:M_Key component is non-zero, and 
M_Key matching, if required, is successful according to the rules specified 
in section 14.2.4 Manaoement Key on oaae 623 . then the SM shall gen- 
erate a SubnGetResp. Otherwise the SubnSet(SMInfo) or Sub- 
nGet(SMInfo) is silently discarded. 

C14-68: When a Master SM receives a SMP containing a SubnTrapQ, it 
shall not check that the MADHeader:M_Key f\e\d matches the Port- 
lnfo:M_Key of the port where the SMP was received. 

The SMInfo:SM_Key is used by the Master SM to authenticate other 
standby SMs and master SMs, in the case of a subnet merge, on the 
subnet. Exactly how the key is used is implementation specific. A SM 
should fill the SM_Key in the S/W/nfo;SA/f_/<ey component in a response if 
it expects the requesting SM to check it. 

14.4.7 SM Disable Mechanism 

C14-69: If a SM can reside on a port, a vendor defined, out-of-band mech- 
anism shall be provided that when asserted will disable the capability of 
running a SM from that port and the state of the mechanism shall be indi- 
cated in the Portinfo:CapabilityMask.lsSMdisable bit. 

C14-70: When the Portinfo:CapabilityMask.lsSMdisable bit is asserted, 
the port behavior shall be: 

• SubnSet(SMInfo) or SubnGet(SMInfo) sent to that port shall be dis- 
carded 

• SubnSetC) or SubnGet(*) shall not be sent from that port 

• The Portinfo:CapabilityMask.lsSM bit for that port shall not be set. 

C14-71: When Portinfo:CapabilityMask.lsSMdisable is not-asserted, the 
port behavior shall be: 
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• SubnSet(SMInfo) or SubnGet(SMInfo) sent to that port will be for- 1 
warded to management entities if the appropriate entity is operational 2 

• SubnSetC) or SubnGet(*) may be sent by management entities from 3 
the port 4 

• The Portinfo:CapabilityMask.lsSM bit is controlled by management 5 
entities behind that port 6 

The mechanism for changing the state of the PortinfoiCapability- ^ 
Mask.lsSMdisable bit is beyond the scope of the specification. 8 

9 

C14-72: The state of the Portinfo:CapabilityMask.lsSMdisable on a port i q 
shall be changeable at any time while the port is operational. ^ ^ 

12 

Changing the state oiPortinfo:CapabilityMask.lsSMdisableUom asserted 
to not-asserted while the port is otherwise operational may cause a SM to '^^ 
startup, but that is beyond the scope of the specification. 14 
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Chapter 15: Subnet Administration i 

2 

3 
4 
5 
6 

This chapter defines IBA Subnet Administration (SA) communication and 7 
the function of that communication: the MADs used, and the functions g 
they are associated with. 

y 

10 

11 

15.1.1 SA Function 12 

Through the use of Subnet Administration class MADs, SA provides ac- ^ 3 
cess to and storage of information of several types, some optional. 14 

15 

C15-2: The information that shall be provided by SA is specified in Table ^5 
149 Subnet Administration Attributes (Summarv) on page 679 . 

18 
19 

Information that endnodes require for operation in a subnet. Such in- 20 
formation includes paths between endnodes, notification of events, 21 
service records, etc. This information is required. 22 

information that is non-algorithmic, typically. Information that cannot 23 

be recovered algorithmically by inspection of the network after a pow- 24 

er-on or initialization event. Such information includes partitioning da- 25 

ta, M_Keys, SL to VL mappings, etc. This information is required to 25 



The types of information involved are: 



27 
28 
29 
30 



allow off-line migration from one vendor's subnet management imple 
mentation to another's. This is required. 

Information that may be useful to other management entities such as 
standby SMs, who may, for example, wish to use it to maintain syn- 
chronization with the master SM. Such information includes subnet 
topology data, switch forwarding tables, etc. This is optional. 31 

32 

In order to perform these functions, SA includes three functions: 
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• A reliable multipacket data transfer protocol, required because the in- 
formation sent to or retrieved by SA in many cases is larger than will 
fit into a single MAD unreliable datagram 

• A query subsystem required to identify the information to be sent and 
received 

• An event-fonwarding subsystem that fooA^ards SM-recieved traps and 
notices to subscribed parties. 

The actual SA implementation is outside the scope of architecture. The 
actual access QP and DLID may be redirected by the GSI. 



15.1.2 Relationship Between SA and the SM 



15.1.3 Overview 



15.2 SA MADs 



Much, but not all, of the infomiation provided by SA is created or collected 
by the SM. SA must therefore have a close relationship with the master 
SM. That relationship is defined as follows: 

• SA is part of the SM. Its functions are discussed separately from the 
SM only for convenience of description. This descriptive convenience 
is not intended to imply or require any particular implementation orga- 
nization of the SM (or SA) by any vendor. 

• As is the case for any class of IB management, SA functions may be 
implemented on a host Endnode separate from the one holding the 
SM; whether this is done is vendor-specific. If any SM function is im- 
plemented at a location different from the one identified as holding 
the SM, including but not limited to SA functions, any or all communi- 
cation between that function and any other SM elements is vendor- 
specific. 

CI 5-3: Should an SM is elected master SM, all its components must also 
be implicitly elected master, including but not limited to SA, however they 
may be implemented. If an SM ceases to be master, all of its components, 
including but not limited to SA, must cease responding to messages from 
client nodes. 



The remainder of this chapter first defines the MADs used by SA. It also 
defines the reliable multipacket transport protocol; and then the operation 
of SA. The SA operations described include: locating the SA, the SA 
methods and their operation. Also described are identification of informa- 
tion records, access restrictions that must be implemented, versioning, 
and event fonA^arding. 



This section defines the MADs sent and received by SA. 
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15.2.1 SA MAD Format 



C15-4: The SA MADs must use the Generic Services Interface (GSI). and 
adhere to all GSI rules of use. Like all MADs, they must conform to MAD 
use as specified in 13.4 Management Datagrams on page 573 . 



Table 145 on page 672 shows the Subnet Administration datagram format 
that will be used, and Table 146 Subnet Administration Fields on page 672 
defines the fields contained in the Subnet Administration datagram. 

Table 145 Subnet Administration Format 



bytes 








0-24 


Standard MAD Header (see Fiaure 1 36 on oaae 575) 


24 


SA_KEY 


28 


32 


SM_KEY 


36 


40 


Segment Number 


44 


Payload Length 


48 


Fragment Flag 


Edit Modifier 


Window 


52 


EndRID 


56 


ComponentMask 


60 


64 


Admin Data (192 bytes) 




252 



Table 146 Subnet Administration Fields 



Field 


Length 


Description 


SA_KEY 


64 bits 


Subnet Administration key value If 0 no prior admin queries performed. 
Ignored by non-query methods. 


SM_KEY 


64 bits 


Subnet Manager verification key. Refer to Chapter 14. 


Segment Number 


32 bits 


Segment number of a segmented Subnet Admin packet. 


Payload Length 


32 bits 


The number of valid data bytes in data stream, if a multi-packet data 
sequence in the first packet of a reouest or a resoonse. 15.3.1.2 Pavload 
Lenath on oaae 697. 


FragmentFlag 


8 bits 


refer to 15.3.1.3 Fraament Flao on oaae 697 
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Table 146 Subnet Administration Fields 



Edit Modifier 


8 bits 


An enumeration with the following values: 

\J rVJKJ tcUUlU illUUIMcr 

1 - Delete record modifier 

2 - Edit record modifier 

3 to 255 - Reserved 


Window 


16 bits 


For multlpacket operations. Refer to Windowing 15.3,1.4 


End RID 


32 bits 


Used exclusively for indicating the ending Record identifier (RID) for table 
query and manipulation operations. Set to zeroes for all other operations. 


ComponentMask 


64 bits 


Used to indicate attribute components to be used for query and edit opera- 
tions. Bit 0 maps to first attribute, bit 1 the second attribute, and so forth. A 
bit set to one indicates attribute component is used to form query or edit 
operation, otherwise field is to be ignored. 


Subnet Admin 
Data 


1536 bits/ 
192 bytes 


Data field where attribute content is stored. 



15.2.1.1 SA-Specific ClassPortInfo:CapabiutyMask Bits 



The Subnet Administration class uses two class-specific bit of tlie Class- 
Port! nfo:CapabilityMask: 

bit 8 is defined and named "IsSubnetOptionalRecordsSupported" 

• bit 9 is defined and named "IsUDMulticastSupported." 

CI 5-5: If lsSubnetOptionalRecordsSupported=1, SA must support all 
records listed as optional in Table 149 Subnet Administration Attributes 
(Summary) on page 679 except for MCGroupRecord and MCMember- 
Record, and all the methods listed as optional in Table 147 Subnet Admin- 
istration Methods on page 674 . This bit must not be used to indicate 
support for only some of those records and methods. If IsSubnetOption- 
alRecordsSupported=0, SA does not support those records and methods. 

C15-6: If lsUDMulticastSupported=1, SA must support MCGroupRecord 
and MCMemberRecord as listed in Table 149 Subnet Administration At- 
tributes (Summary) on page 679 . If lsUDMulticastSupported=0, SAdoes 
not support those records. 

See 13.4.8.1 ClassPortlnfo on page 589 for a description of the Class- 
Portlnfo:CapabilityMask. 



15.2.2 Summary of Methods 



Table 147 Subnet Administration Methods on paoe 674 summarizes the 
methods provided by the Subnet Administration class. Several of these 
are common methods described in 13.4.5 Management Class Methods 
on page 577 : some are unique to this class. Subnet Administration 
methods are described in more detail in 1 5.4 Operations on page 706 . 
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Those methods which use the Multipacket Reliable Transport Protocol 
(see section 15.3 on page 696 ) have "Yes" in the "Multi-Packet" column. 

C15-7: SA must support all the methods listed as required in Table 147 on 
oaoe 674 . 

o1 5-1 : SA may support the methods listed as optional in that table; either 
all such methods must be supported, or none. 

Table 147 Subnet Administration Methods 



Method Type 


Value 


Multi- 
Packet? 


Optional 
or 

Required 


Description 


SubnAdmGet() 


0x01 


No 


Required 


Request a get (read) of an attribute from a node. 


SubnAdmGetRespO 


0x81 


No 


Required 


The response from an attribute get or set request. 


SubnAdmSetO 


0x02 


No 


Required 


Request a set (write) of an attribute in a node. The object will 
issue a SubnAdmGetRespO as a response. 


SubnAdmlnformO 


0x10 


No 


Required 


Request an event subscription. 


SubnAdmlnformRespO 


0x90 


No 


Required 


Reply to an event subscription request. 


SubnAdmReportO 


0x11 


No 


Required 


Fonward an event previously subscribed for. 


SubnAdmReportRespO 


0x91 


No 


Required 


Reply to a SubnetAdmReport method. 


SubnAdmGetTableQ 


0x12 


No 


Required 


Subnet Manager table request. 


SubnAdmGetTableRespQ 


0x92 


Yes 


Required 


Subnet Manager table request response. 


SubnAdmGetBulkQ 


0x13 


No 


Optional 


Dump Subnet Manager data request. 


SubnAdmGetBulkRespQ 


0x93 


Yes 


Optional 


Dump Subnet Manager data response 


SubnAdmConfigO 


0x15 


Yes 


Required 


Request to configure 


SubnAdmConfigRespQ 


0x95 


Yes 


Required 


Response to configuration request 



15.2.3 Subnet Administration Status Values 

Table 148 Administration MAD Status Field Bit Values 



Name 


Bit 


Meaning 


Common bit values 


0-7 


See 13.4.7 Status Field on oaae 587 


ERR_KEY_STALE 


8 


Supplied key is stale, need to reload table records/get new lease. 


ERR_REQJNVALID 


9 


Supplied request or update is invalid. 




10-11 


Reserved 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfiniBand®'^ Trade Association 



Page 674 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Subnet Administration 



October 24. 2000 
FINAL 



Table 148 Administration MAD Status Field Bit Values 



Name 


Bit 


IVIeaning 


ERR_REFUSED 


12 


Policy violation 


RESERVED2 


13-15 


Reserved for future use. 



15.2.4 Attributes and Record Attributes 

The contents of the Admin Data field of SA MADs are, in effect, data 
records of various types. Since the attribute field of the standard header 
identifies Admin Data formats and semantics, the term Record Attribute 
(RA) is used to refer to these attribute types. 

RAs may be spol<en of as logically stored in SA; while some must actually 
be stored, whether others are actually stored or are computed in response 
to a query is implementation-dependent. 

Three ways exist for RAs to be stored by SA: 

• Attributes are captured and deposited by the Master SM in the 
course of routing and sweeping the subnet. 

• Attributes are stored as a result of traps that are captured and 
logged by the Master SM. 

• Attributes are logged by administrative software updating subnet 
configuration data. Some of these may, ultimately, have been en- 
tered by a customer; for example, partition records recording cus- 
tomer's partitioning configuration input. 

15.2.4.1 Record Attributes (RA) 

To distinguish them, RAs use a naming convention: They are all called 
XXXRecord, where XXX is the name by which the data is known. When 
the data is an SM attribute the XXX is the SM attribute name (see 14.2.5 
Attributes on page 626 ). For example, the Nodelnfo subnet management 
attribute is reflected in the RA NodelnfoRecord. Other RAs have no coun- 
terpart in SM attributes, but use this convention anyway; PathRecords 
and MCGroupRecords are examples. 

Every RA has identifiers associated with it called Record Identifiers 
(RIDs), using the layout illustrated in Figure 158 on page 676 . Each RA 
always has its own RID (as illustrated), and in some cases there Is also a 
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RID that indicates relations to other RAs (not illustrated). RIDs are defined 
in 15.2.4.3 on pace 677 : they serve several purposes: 



Record 
Attribute 



Figure 158 Record Attribute 

• They allow query and modification of the attribute Itself 

• They allow local relational organization of obtained data, if de- 
sired (this is implementation dependent) 

• They allow further queries of SA using these RIDs 

As a response to a query method, the RAs are organized into tables when 
multiple RAs are returned. 




15.2.4.1.1 State Record RAs 



State records are RAs are only available for query exclusively. They 
cannot be edited. These records reflect dynamic topology data constantly 
being used and manipulated by the SM. Examples of these kinds of RAs 
are the PortlnfoRecord and NodeRecord. 

15.2.4.1.2 Configuration Record RAs 

Configuration records RAs are not normally updated by the SA. Service- 
Record (a type of configuration record RA; see Section 15.2.5. 14 on page 
684 ) is an exception which is deleted when the service lease expires. Ex- 
amples of Configuration records are the ServiceRecord and the Inform- 
Record. These RAs are subject to edit by management applications by 
the administrative interface. 

A change in a configuration record implies possible activity by the SM to 
bring a subnet into compliance with the new RA. In some cases this is un- 
necessary; for example, changes to ServiceRecords imply no change 
other than to provide these new RAs on later queries. Other changes to 
configuration records could have more dramatic effects. For example, 
changes to the RangeRecords may cause the master SM to update range 
records on standby SMs to bring the standbys into compliance with the 
master SM (this is implementation dependent). 
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15.2.4.2 RECORD Tables and Bulk Records 

Collections of RAs transferred by communications with subnet adminis- 
tration are called tables. These tables consist of a single type of RA, such 
as NodeRecords. 

Collections of all the tables together in a single collection is referred to as 
bulk. 

15.2.4.3 Record Identifiers (RIDs) 

All RAs have at least a single RID. This RID is used to provide a RA with 
a specific unique identifier for query and edit operations. Most RAs have 
a single RID, referred to as a major RID, which is the following format: 



LID 



PortNumber Enumeration 



16 bits 8 bits 8 bits 

Figure 159 General RID model 

The LID will be the base LID of the port in question. 

For switches the PortNumber is also used to specify to which port a RA is 
related to. This is not used for channel adapters, and set to 0. 

The enumeration is used where more than one, but less than 256 RAs are 
related to a specific port. 

For example, several GUIDInfo RAs for CA port with a LID of 5 will use 
the LID portion of the RID to refer to the CA ports base LID, set the Port- 
Number portion of the RID to 0 (as it is unused for CAs), and will use the 
enumeration of 1 to refer to the 1st GuidlnfoRecord RA related to LID 5, 
and enumeration value of 2 to refer to the second GuidlnfoRecord RA re- 
lated to LID 5, and so on. 

RAs that cannot be cataloged with the general RID model are instead 
identified with the unique RID, which is as follows: 



Related / 

NodeRecord 

RID 



Unique RID(32 bits) 


LID 


PortNumber 


Enumeration 



Figure 160 Unique RID model 



The Unique RID, another type of major RID, is any unique value for a RA 
coupled with a secondary RID relating the RA to a specific NodeRecord. 
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In this way all RAs can be related back to a given port that have a deter- 1 
mined relation. This second RID is referred to as the related RID. 2 



15.2.4.3.1 Special RID Usage 



15.2.4.4 LID Aliasing 



15.2.5 Attributes 



SIToViMappingTableRecord uses the General RID, using the Port- 
Number to indicate the input port and the enumeration to indicate the 
output port this record references. 

VIArbitrationTableRecord the General RID is used, where the Port- 
number corresponds to the Portnumber of the switch this attribute is as- 
signed to, and the enumeration is used to specify the part of the table 
assigned, as defined in Chapter 14. 

LinearForwardingTableRecords use the General RID model, however, 
they use the PortNumber and Enumeration fields as a single16 bit integer 
field instead of as two separate fields to enumerate up to 767 possible 
records for a switch. 

RandomForwardingTableRecords use the use the General RID model, 
however, they use the PortNumber and Enumeration fields as a singlelB 
bit integer field instead of as two separate fields to enumerate up to 3071 
possible records for a switch. 

MulticastForwardingTableRecords uses the General RID model, but 
again the PortNumber and Enumeration fields are viewed as a single 
value to specify the 14 meaningful bits of the MulticastForwardingTable 
assigned to a switch, the two low order bits are not used and set to 0. 

PartitionRecords uses the General RID to refer to the CA, Switch or 
Router this attribute is assigned to, will also use the PortNumber to specify 
the Port this record is assigned to. The Unique RID is a unique value as- 
signed by SA to the P_KeyTableRecord. 



LID assignment can change in the subnet due to a variety of conditions. 
Ports that have multiple LIDs assigned to them by use of the LMC can be 
addressed by the base LID, or by any LID value that could be valid for that 
RA by exercise of the LMC. The ability to use any of the values of the LID 
supported for an object by use of the LMC is called LID aliasing. 



This section first provides a summary of all the SA attributes, and then lists 
the data format of each, with descriptions where warranted. 



15.2.5.1 Summary of Attributes 



CI 5-8: SA must process all the attributes listed as required in Table 149 
on page 679 . which summarizes the SA attributes. 
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o15-2: SA may also process the attributes listed as optional in Table 149 
on page 679 . These form two groups, one for UD Multicast, and the other 
for other optional records. Either all of a group must be processed, or none 
of that group. See 15.2.1.1 SA-SpecificClassPortlnfo:CapabilitvMaskBits 
on oaqe 673 for the definition of those groups. 

F?As are noted in that table, along with the Subnet Management attribute 
they contain; see also 15.2.1.1 SA-SpecificClassPortlnfo:CapabilitvMask 
Bits on page 673 . 

Configuration and State records are noted in Table 149, along with the 
Subnet Management attribute they contain. 

Table 149 Subnet Administration Attributes (Summary) 









uired 


utes 


c 
o 






Attribute Name 


Container for 


Attribute 


Q) 


n 
*c 


2 






(if applicable) 


ID 


75 


< 

V 


ifigu 


B 
coi 


Description 








Optior 


Recoi 


«Sl 






ClassPortlnfo 


N/A 


0x0001 


R 


n 


n 


n 


Class infomnation; see 13.4.4 


Notice 


N/A 


0x0002 


R 


n 


n 


n 


Notice infonfnation; see 13.4.8.2 


Infomnlnfo 


N/A 


0x0003 


R 


n 


n 


n 


Subscription (Inform) Information; see 
















13.4.8.3 


NodeRecord 


Nodelnfo & 


0x0011 


R 


Y 


n 


S 


Nodelnfo record 




NodeDescription 














PortlnfoRecord 


Portlnfo 


0x0012 


R 


Y 


n 


S 


Portlnfo record 


SltoVIMapping- 


SltoVIMappingTable 


0x0013 


R 


Y 


n 


S 


SltoVIMappingTable record 


TableRecord 
















SwitchRecord 


Switchlnfo 


0x0014 


0 


Y 


n 


S 


Switchlnfo record 


LinearForwarding- 


LinearForwardingTable 


0x0015 


0 


Y 


n 


S 


Linear Forwarding database entry 


Table Record 














records 


RandomFonA/arding- 


RandomForwardingTable 


0x0016 


0 


Y 


n 


S 


Random forwarding database entry 


TableRecord 














records 


MulticastFonwarding- 


MulticastForwardingTable 


0x0017 


0 


Y 


n 


S 


Multicast Forwarding database entry 


TableRecord 














records 


SMInfoRecord 


Smlnfo 


0x0018 


0 


Y 


n 


S 


Smlnfo record 


InformRecord 


Informlnfo 


0x00F3 


0 


Y 


C 


S 


Informlnfo record 


NoticeRecord 


Notice 


0xOOF4 


O 


Y 


n 


S 


Notice or trap record 


LinkRecord 


N/A 


0x0020 


0 


Y 


n 


S 


Link record 
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Table 149 Subnet Administration Attributes (Summary) 
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Attribute Name 



Container for 
(if applicable) 



Attribute 
ID 



3 3 

(0 p 

1 8 

oi ^ 



2 2 

C/)| 

c 
o 

OI 



Description 



GuidlnfoRecord 


GUIDInfo 


0x0030 


0 


Y 


n 


S 


GIDS assigned to a port 


ServiceRecord 


N/A 


0x0031 


R 


Y 


C 


S 


Service advertisement record 


PartitionRecord 


Partitionlnfo 


0x0033 


R 


Y 


n 


S 


Partition records 


RangeRecord 


N/A 


0x0034 


R 


Y 


C 


S 


Range records 


PathRecord 


N/A 


0x0035 


R 


Y 


n 


S 


Subnet path infornnation 


VLArbitrationRecord 


VLArbitrationTable 


0x0036 


R 


Y 


n 


s 


VL arbitration record 


MCGroupRecord 


N/A 


0x0037 


R 


Y 


C 


s 


multicast group records 


MCMemberRecord 


N/A 


0x0038 


R 


Y 


C 


s 


multicast member record 


SAResponse 


N/A 


0x8001 


0 


n 


n 


n 


Container for subnet query response 



Table 150 Subnet Administration Attribute / Method Map on page 680 as- 
sociates SA attributes with nnethods. 

SA must allow use of all the specified methods with each attribute. 
Table 150 Subnet Administration Attribute / Method Map 



Attribute 


Get 


Set 


Inform 


Report 


GetTable 


GetBulk 


ClassPortlnfo 


x 












Notice 








X 






NodeRecord 


X 








X 


X 


PortlnfoRecord 


X 








X 


X 


SItoVIMappingTableRecord 


X 








X 


X 


SwitchRecord 


X 








X 


X 


LinearForwarding- 
TableRecord 










X 


X 


RandomForwarding- 
TableRecord 










X 


X 


MulticastForwarding- 
TableRecord 










X 


X 
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Table 150 Subnet Administration Attribute / Method Map 



Attribute 


Get 


Set 


Inform 


Report 


GetTable 


GetBulk 


VLArbitrationRecord 










X 


X 


SMInfoRecord 


X 








X 


X 


InformRecord 


X 


X 






X 


X 


NoticeRecord 


X 








X 


X 


LinkRecord 


X 








X 


X 


GUIDInfo Record 


X 








X 


X 


ServiceRecord 


X 


X 






X 


X 


RouterRecord 


X 








X 


X 


PartitionRecord 


X 








X 


X 


RangeRecord 


X 


X 






X 


X 


PathtRecord 


X 








X 




MCGroupRecord 


X 


X 






X 


X 


MCMemberRecord 


X 


X 






X 


X 


SAResponse 












X 



The detailed layouts of all the SA-specific records follows. RAs which are 
containers for Subnet Management records require little description be- 
yond their layout; others have more extensive descriptions 







Table 151 


NodeRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


NodeRID 


32 


0 


The node's RID, used as an RID record 


Nodelnfo 


320 


32 


Nodelnfo Record contents 


NodeDescrip- 
tion 


512 


352 


NodeDescription Record contents 


.2 PortInfoRecord 










Table 152 


PortInfoRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


NodeRID 


32 


0 


RID of this record 


Portlnfo 


432 


32 


Portlnfo Attributes record 
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15.2.5.3 SltoVLInfoRecord 



Table 153 SItoVLMappingTableRecord 



Component 


Length(bit5) 


Offset(bits) 


Description 


SIVLRID 


32 


0 


Unique RID of this record 


NodeRID 


32 


32 


Node RID this record is referencing 


SIVLMapping 


64 


64 


SIToVLMapping attribute 


4 SwitchRecord 










Table 154 SwitchRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


NodeRID 


32 


0 


RID of this record 


Switch Info 


132 


32 


Contents of Switchlnfo Attribute 



15.2.5.5 LinearFdbRecord 



Table 155 LinearFdbRecord 



Component 


Length(bits) 


Offset(bits) 


Description 


LinearFdbRID 


32 


0 


RID of this forwarding record 


NodeRID 


32 


32 


Reference to related node 


LinearFdblnfo 


512 


64 


Contents of Linear forwarding DB 


.6 RandomFdbRecord 










Table 156 RandomFdbRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


RandomFd- 
bRID 


32 


0 


RID of this forwarding record 


NodeRID 


32 


32 


RID reference to node 


RandomFdb 


512 


64 


Contents of Random Forwarding Table 
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1 5-2.5.7 MulticastForwardingRecord 



Table 157 MulticastForwardingRecord 



Component 


Length(bits) 


Offset(bits) 


Description 


McastRID 


32 


0 


RID of this forwarding record 


NodeRID 


32 


32 


RID reference to node. 


MulticastFor- 
wardingTable 


512 


64 


Contents of Multicast Fonrt/arding Table 



15.2.5.8 VLArbitration Record 



Table 158 VLArbitrationRecord 



Component 


Length(blts) 


Offset(bits) 


Description 


VLArbRID 


32 


0 


RID of this fonwarding record 


NodeRID 


32 


32 


RID reference to node. 


VLArbitration 


512 


64 


Contents of VLArbitration Attribute 


9 SMInfoRecord 










Table 159 


SMInfoRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


NodeRID 


32 


0 


RID of related Nodelnfo record 


SMInfo 


168 


32 


Contents of SMInfo Attributes of given SM 


10 PartitionRecord 










Table 160 


PartitionRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


PartitionRID 


32 


0 


Unique RID of this record 


NodeRID 


32 


32 


RID of related Nodelnfo record 


P_KeyTable 


512 


64 


Contents of P_KeyTable attribute 
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15.2.5.11 InformRecord 







Table 161 


InformRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


InformRID 


32 


f\ 
u 


RID of this record 


NodeRID 


32 


32 


RID of related Node Info record 


Inform 


344 


64 


Content of Inform attribute record 


,12 NoticeRecord 










Table 162 


NoticeRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


NoticeRID 


32 


0 


RID of this notice/trap record. 


NodeRID 


32 


32 


RID of related Nodelnfo record 


Notice 


512 


64 


Content of Notice attribute record 


.13 LinkRecord 




Table 163 


LinkRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


LinkRID 


32 


0 


RID of this record 


FromLID 


16 


32 


From InfiniBand address [should this be GUID] 


ToLID 


16 


48 


To InfiniBand address 


FromPort 


8 


64 


From port number 


ToPort 


8 


76 


To port number 



Link records are synthesized by the subnet administration to serve as in- 
formational topology data for management entities in need of such data. 



15.2.5.14 SERVICERECORD 







Table 164 


ServiceRecord 


Component 


Length(bits) 


Offset(bits) 


Description 


ServiceRID 


32 


0 


RID of this service record 
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Table 164 ServiceRecord 



Component 


Length(bits) 


Offset(bits) 


Description 


ServiceLease 


32 


32 


Lease period remaining for this service, in seconds. 
Oxffffffff is an Indefinite lease. 


Partition 


16 


64 


Partition of this Service 


ServlceSpeci- 
ficFlags 


12 


80 


Information related to this service. Content is service 
specific. 


ServiceGener- 
icFlags 


4 


92 


Generic information related to this service. The inter- 
pretation of individual bits in the field is as follows: 
Bit 0'. Indirection, set if the service provider may redi- 
rect requests; 

Bit 1 : DHCP-capable. set if a DHCP server or a Direc- 
tory Agent may automatically register this service by 
querying the SA; 
Bit 2: Reserved. 
Bit 3: Reserved. 


ServiceName 


992 


96 


Null-terminated name of the service 


ServiceGID 


320 


1088 


Text representation (compliant with IPv6 conven- 
tions^) of the port GID for the service, null-terminated 


ServicelD 


128 


1408 


String of 16 hexadecimal digits, including any leading 
zeros, not null-terminated 



a. Hinden, R. and Deering, S., RFC 2373: IP Version 6 Addressing Architecture, July 1998 (Section 2.2 
of the document describes rules for text representation of IPv6 addresses.) 



15.2.5.14.1 ServiceName 



Service records serve the purpose of first level or "bootstrap" advertise- 
ment of basic services that cannot be found prior to query of the SA. 
These could be services such as boot services, or name or directory ser- 
vices. 

ServiceRecords are not intended to do more than to provide a first level 
directory to other applications and services normally associated with a 
network. If there are more than one ServiceLocations associated with a 
ServiceName, there are multiple Service Records; one for each Service- 
Location. 



ServiceRecord:ServiceName is an ASCII string identifying what service is 
being sought (for example "tftp", "CFM.IBTA", "sendmail", and so on). This 
is a 124-bytes long, null-terminated string. 
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15.2.5.15 RangeRecord 



Table 165 RangeRecord 



Component 


Length(bits) 


Offset(bits) 


Description 


RangeRID 


32 


0 


RID of the range record 


NodeRID 


32 


32 


RID of related Nodelnfo record 


FmAssigned 


64 


64 


QUID of SM allotted range 


FromRange 


16 


128 


Value of beginning of range 


ToRange 


16 


144 


Value of end of range 



15.2.5.16 PathRecord 



Range records specify ranges of LIDs. They exist to allow avoidance of 
LID conflicts in some cases. A Master SM can use them to provide ranges 
of LIDS to standby SMs, thereby enabling the standby SMs to use known 
unique ranges if a subnet they control is independently initialized. Sub- 
nAdmConfigO can be used to "push" these from master to standby, and 
Get operations can be used by standbys to get ranges from the master. 



Table 166 PathRecord 



Component 


Length(bits) 


Offset(bits) 


Required For 
GetTable 
Request 


Description 


PathRID 


32 


0 




RID of this PathRecord 


DGID 


128 


32 




Destination GID to establish path to 


SGID 


128 


160 


X 


Source GID to establish path from 


DLID 


16 


288 




Destination LID 


SLID 


16 


304 




Source LID 


RawTraffic 


1 


320 




Raw Packet path 

0 - IB Packet (P_Key must be valid) 

1 - Raw Packet traffic (No P_Key) 


ReservedS 


3 


321 




Reserved (Ignored) 


FlowLabel 


20 


324 




FlowLabel (to be used in the GRH if GRH used) 


HopLimit 


8 


344 




Hop limit (to be used in the GRH is GRH used) 


TCIass 


8 


352 




TCIass (to be used in the GRH if GRH used) 


Reserved 1 


1 


360 




Reserved (Ignored) 
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Table 166 PathRecord 



Component 


Length(bits) 


Offset(bits) 


Required For 
GetTable 
Request 


Description 


NumbPath 


7 


361 


X 


Maximum number of paths to return (or be returned) 
If more paths exist, the paths returned meet the 
requirements in the GetTable request, but are limited 
to this number of entries (implementation dependent) 


P_Key 


16 


368 




Partition Key for this path 


SL 


16 


384 




Service level - bit significant (MSB 16..,LSB 0) 


MtuSelector 


2 


400 




0- greater than or equal to MTU specified 

1- less than or equal to MTU specified 

2- exactly the MTU specified 


Mtu 


6 


402 




Enumeration of the MTU required: 

1:256 

2: 512 

3: 1024 

4: 2048 

5: 4096 

5-63: reserved 


RateSelector 


2 


408 




0- greater than or equal to rate specified 

1- less than or equal to rate specified 

2- exactly the rate specified 


Rate 


6 


410 




Enumeration of the rate: 
1 : 1 Gb/sec. 
2: 2.5 Gb/sec. 
3: 10 Gb/sec. 
4: 30 Gb/sec. 


PacketLlfe- 
TimeSelector 


2 


416 




0- greater than or equal to PacketLifeTime specified 

1- less than or equal to PacketLifeTime specified 

2- exactly the PacketLifeTime specified 


PacketLlfe- 
Time 


6 


418 




Accumulated packet life time for the path specified by 
an enumeration derived from in units of 
4.096 microseconds * 2'^PacketLifeTimet 



The PathRecord RA is used to request routing information between end- 
nodes. Its results are required to create connections and perform other 
tasks. The data returned in a PathRecord is usually generated, based on 
the routing algorithm used by the SM. SA may issue a secondary redirect 
to another service to respond to the request. Redirected requests are ser- 
viced using the same class requests and responses. 
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PathRecords can use the Administration Query Subsystem ( 15.4.5 on 
page 708 ) to request paths with desired properties. Using the component 
mask, the requester can build a GetTable request and supply the known 
fields in the record; the reply from SA will supply the response entries 
which match the request. For example, by setting the Component Mask 
used to cause everything but the SGID and DGID to be ignored, a Sub- 
nAdmGetTableQwW return PathRecords for all paths from the SLID to the 
DLID. By selectively specifying the qualities desired, a path with any given 
qualities can be requested. 

Normally the DGID is known (or at some point learned i.e. name service). 
But during a "boot" sequence, it may be useful to leave it unspecified, thus 
returning paths to all endnodes reachable from an SGID. 

C15-9: SA shall provide a wildcard PathRecord query such that when the 
DLID and DGID are both specified as component mask entries of 0 in a 
query, and the SLID is that of the requester, that query shall return a single 
path record to each reachable port. Which path returned for each reach- 
able port is indeterminate. 

This can be used as the equivalent of a operating system's "bus walk" that 
finds all reachable devices. Note that 15.4.1 Restrictions on Access on 
page 706 requires that the only paths returned are to devices which are 
visible under the partitioning arrangements in force. 

Table 167 on page 688 is an example of the MAD header of a request for 
PathRecords (the data field is shown in the subsequent table): 

Table 167 Example PathRecord Request MAD Header 



Component 



Value 



Description 



Header: 
Method 



0x12 



SubnAdmGetTable 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
^6 
27 
''28 
29 
"30 
31 
"32 
33 
'34 
35 
"36 
37 
"38 
-39 
40 
-41 
42 



Header: MAD 
status 



Set to zero (ignored) 



Header:Attribu 0x00038 
telD 



Specifying the PathRecord 



HeadenAttribu Oxffffffff 
te Modifier 



Specifying a Query request matching the record in the 
data field 1 



Header: 
EndRID 



0x0000 



Set to 0, ignored by SA 



TransactionID 0x11223344 Transaction ID (to be returned In the response) 



PayLoad 
Length 



0x75 



1 * sizeof (PathRecord) + Header=1*0x35+0x40 
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Table 168 Example PathRecord Reouest on pace 689 shows what the 
data field in request would look like if up to two paths are requested to ac- 
cess a service. The path specifications being requested are: 



the MTU is no larger than 1024, 
the rate must be at 2.5 Gb/sec, 

the path must be in the partition which the requester has access to, 

for any sen/ice level, 

for any Packet Life Time cost, 

for any TCIass, 

for raw or IB traffic, 

any flow label, 

or any number of "hops" 

Table 168 Example PathRecord Request 



Component 



Value 



Meaning/Implication 



PathRID 



Query any record 



DGID 



SGID 



Service global GID of the service to communicate with (typically from 
ID a name service) 

Local global ID Source GID address 



DLID 



Indicates to the Subnet Administration that the path 
may be to any port on the destination node. 



SLID 



Indicates to the SA that the path may be from any port 
on the local node. 



RawTraffic 



IB traffic (P_Key will be valid) 



ReservedS 



ignored - set to 0 



FlowLabel 



FIowLabel (to be used in the GRH if GRH used) 



HopLimit 



Hop limit (to be used in the GRH is GRH used) 



TCIass 



TCIass (to be used in the GRH if GRH used) 



Reserved 1 



ignored - set to 0 



NumbPath 



Requesting only 2 paths which meet these specifica- 
tions 



P_Key 



0x1234 



P_Key to be used for this path) 



SL 



SL for this path (bit 3 means SL3) 



MtuSelector 
Mtu 



Paths must an MTU less then 1024 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
J7 
"l8 
=19 
_20 
21 
_22 
_23 
24 
25 
26 
_27 
28 
■^9 
-30 
_31 
32 
33 
-34 
-35 
36 
-37 
_38 
39 
40 
-41 
J\2 
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Table 168 Example PathRecord Request 



1 

-2 



Component 



Value 



Meaning/Implication 



RateSetector 



Path rate must be exactly 2.5Gb only 



Rate 



PacketLife- 
TimeSelector 



Retum paths with any PacketLifeTime 



PacketLife- 
Time 



10 

For this example the following is a possible resulting response header and 1 1 
the PathRecords found in the data field of the response:: 12 

13 

Table 169 Example PathRecord Response MAD Header 

15 
=16 
17 
-18 
19 
-20 
21 
-22 
23 
-24 
25 
-26 
21 

0xA7 2 * sizeof (PathRecord) + Header=2*0x35+0x40 28 

?9 

30 
31 
32 
.33 

^34 

PathRID 0x01 RID of this Path record 35 

^36 

DGID Service global GID of the service to communicate with (typically from 

ID a name service) 3^ 

^9 



Component 



Value 



Meaning/Implication 



Header: 
Method 



0x92 



SubnAdmGetTableResp 



Header: MAD 
status 



Good status 



HeaderAttribu 
telD 



0x00038 



Specifying the PathRecord 



Header:Attribu 
te Modifier 



0x0001 



PathRecord RID being returned starts with 1 



Header: 
EndRID 



0x0002 



Number of records in this packet 



TransactionID 0x11223344 Same as in request 



PayLoad 
Length 



The following is the records in the data field of the resulting response: 
Table 170 Example PathRecord Response 



Component 



Value 



Meaning/Implication 



SGID 



Local global ID Source GID address 



DUD 



0x0008 



The LID assigned to the port where this service can 
be reached 

^ 41 

42 
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Table 170 Example PathRecord Response 



Component 



Value 



Meaning/Implication 



SLID OxOOOA The LID assigned to the port where this service can 

be accessed for the SGID 



RawTraffic 


0 


IB traffic (P_Key will be valid) 


ReservedS 


0 


ignored - set to 0 


FlowLabel 


0 


Default FlowLabel since this is an intra-subnet DGID 


HopLimit 


0 


Default HopLimit since this is an intra-subnet DGID 


TCIass 


0 


Default TCIass since this is an intra-subnet DGID 


Reserved 1 


0 


ignored - set to 0 


NumbPath 


2 


2 paths are being returned 


P_Key 


0x1234 


P_Key to be used for this path) 


SL 


8 


This path has a SL of 3 (bit 3) 


MtuSelector 


2 


Paths is exactly 1024 


Mtu 


3 




RateSelector 


2 


Path rate Is exactly 2.5Gb 


Rate 


2 




PacketLife- 
TimeSelectort 


2 


Path PacketLifeTime was not specified 


PacketLife- 
Time 


0 




PathRID 


0x03 


RID of this record 


DGID 


Service global 


GID of the service to communicate with (typically from 



1 

•2 

3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
■13 
14 
15 
'16 
-17 
.18 
19 
20 
"21 
■22 
23 
■24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
-37 
.38 
39 
40 
■41 
42 



ID 



a name service) 



SGID 



Local global ID Source GID address 



DLID 0x0009 The LID assigned to the port where this service can 

be reached 



SLID 



RawTraffic 



OxOOOA The LID assigned to the port where this service can 

be accessed for the SGID 

0 IB traffic (P_Key will be valid) 



ReservedS 



ignored - set to 0 



FlowLabel 



Default FlowLabel since this is an intra-subnet DGID 



HopLimit 



Default HopLimit since this is an intra-subnet DGID 



TCIass 



Default TCIass since this Is an intra-subnet DGID 



Reserved 1 



ignored - set to 0 
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Table 170 Example PathRecord Response 



Component 



Value 



Meaning/Implication 



NumbPath 



2 paths are being returned 



P_Key 



0x1234 



P_Key to be used for this path 



SL 



This path has a SL of 3 (bit 3) 



MtuSelector 



Paths is exactly 512 



Mtu 



RateSelector 



Path rate is exactly 2.5Gb 



Rate 



PacketLife- 
TimeSelector 



Path PacketLifeTime is 4.096usec * 2'^10=4.2msec 



PacketLife- 
Time 



OxOA 



15.2.5.17 MCGroupRecord 



Table 171 MCGroupRecord 



Component 


Length(bits) 


Offset(bits) 


Description 


McGroupRID 


32 


0 


RID of this record supplied by Subnet administration 

in the response to the add (create). 

add request: value Ignored by SA 

delete request: RID returned on the create 


MGID 


128 


32 


Multicast GID for this multicast group 

add request: if zero, the subnet admin allocates an 

available MGID, else uses the specified MGID. 


Q_Key 


16 


160 


Q_Key supplied in the request 
add request: non-zero 


MUD 


16 


176 


Multicast LID for this multicast group 

add request: the response provides the MLID 



1 

-2 



"5 
-6 



8 



9 

-10 
-11 
J 2 
13 
J 4 
15 
J 6 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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Table 171 


MCGroupRecord 


Component 


Length(bits) 


Offset(bits: 


I Description 


MTU 


8 


192 


MTU of this multicast group (For a create must be 

specified and zero is invalid. For a delete, zero 

matches all records.) 

0:reserved 

1:256 

2: 512 

3: 1024 

4: 2048 

5: 4096 

6-255: reserved 


TCIass 


8 


200 


TCIass to be used in the GRH if GRH Is used; 
Specified on a create and distributed to the member 
record on a successful join. 


P_Key 


16 


208 


Partition Key for this Multicast group (Must be speci- 
fied) 


RawTraffic 


1 


224 


Traffic will be raw packets (No P_Key) 

0- IBA packet traffic (P_Key must be valid) 

1- Raw packet traffic 


Reserved3 


3 


225 


Reserved (Ignored) 


FlowLabel 


20 


228 


Flow label to be used in the GRH if GRH is used; 
Specified on a create and distnbuted to the member 
record on a successful join. 


HopLimit 


8 


248 


Hop limit to be used in the GRH if GRH is used; 
Specified on a create and distributed to the member 
record on a successful join. 



When an entity wishes to create a multicast group, it can be done with ei- 
ther the SubnAdmConfigQ or the SubnAdmSetQ methods to create a IVIC- 
GroupRecord. 

o1 5-3: Using the SubnAdmSetQ method, the edit modifier (see Table 105 
Attributes Common to Multiple Classes on page 588 ) must be setup ac- 
cordingly; set to add for creating a multicast group and delete for removing 
a multicast group. One cannot edit (or modify) a group. 

See section 15.4.9 SubnAdmConfigQ & SubnAdmConfiqRespQ - Add, 
modify or delete RAs on page 713 . for specifics on using the SubnAdm- 
ConfigQ method to allocate/delete a MCGroupRecord. 

A multicast group can be created by the SubnAdmSetQ method, speci- 
fying Add Record in the edit modifier, the Q_Key, MTU and the P_Key (all 
other fields are zero). If a particular MGID is required, it can be specified 
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in the SubnAdmSetQ as well. When a multicast group is deleted, it can be 
done by the SubnAdmSetQ method, specifying the Delete Record in the 
edit modifier and the McGroupRID. Alternatively, it can specify the MGID, 
Q_Key, MLID^ MTU and P_Key. If a field is and can be set to zero for a 
delete, it will result in a match for that field for all MCGroupRecords by 
Subnet administration. 

o15-4: The addition of a new record or removal of a MCGroupRecord im- 
plies that the SM shall program routers and switches with the new multi- 
cast information. 



15.2.5.18 MCMemberRecord 



Table 172 MCMemberRecord 



Component 


Length(bits) 


Offset(bits) 


Description 


MCMember- 
RID 


32 


0 


RID of this record 

Zero specified in the leave request results in a match 
for all records 


MGID 


128 


32 


Multicast GID address for this multicast group 
Required In the request and returned in the response. 


Q_Key 


16 


160 


Q_Key is supplied at Multicast Group Creation time 

by the creator. 

Returned in the response. 


MUD 


16 


176 


Multicast LID, assigned by the SM at creation time. 

Zero for a join request. Ignored by Subnet administra- 
tion. 

Returned in the response for a join/leave. 


LLID 


16 


192 


LID of requester 

Returned in the response for a join/leave. 


TCIass 


16 


208 


TCIass to be used in the GRH if GRH is used; 
Specified in the group record and distributed to the 
member record on a successful join. 
0 - unspecified for a query (matches any) 


RawTraffic 


1 


224 


Traffic will be raw packets (No P_Key) 

0- IBA packet traffic (P_Key must be valid) 

1- Raw packet traffic 


Reserved3 


3 


225 


Reserved (Ignored) 


FlowLabel 


20 


228 


Flow label to be used in the GRH if GRH is used; 
Specified in the group record and distributed to the 
member record on a successful join. 
0 - unspecified for a query (matches any) 
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Table 172 MCMemberRecord 1 

2 



Component 


Length(bits) 


Offset(bits) 


Description 


HopLimit 


8 


248 


Hop limit to be used in the GRH if GRH is used; 
Specified in the group record and distributed to the 
member record on a successful join. 
0 - unspecified for a query (matches any) 


P_Key 


16 


256 


Partition key is supplied at Multicast Group creation 



3 
4 
5 
6 
7 
8 

time by the creator. g 
Non-zero in the request. Checked by Subnet adminis- 
tration. ^ 0 
(Note: Multicast groups can't span partitions with a 11 
single MLID) ^2 

13 
14 



When an entity wishes to join a nnulticast group, it can be done with either 
the SubnAdmConfigQ or the SubnAdmSetQ methods to create a MC- 
MennberRecord. ''5 

16 

When using the SubnAdmSetQ method, the edit modifier (see Table 105 17 
Attributes Common to Multiple Classes on page 588 ) must be setup ac- 1 g 
cordingly; add for joining a multicast group and delete for leaving a multi- 
cast group. 2o 

o15-5: SA shall respond to a SubnAdmSetQ method for a MCMember- 
Record that has the edit modifier set to edit with a SubnAdmGetRespQ 22 
with the status set to invalid attribute. 23 

24 

A multicast group can be joined using the SubnAdmSetQ method by spec- 25 
ifying Add Record in the edit modifier and the MGID (all other fields are 25 
zero). See section 15,4.9 SubnAdmConfigQ & SubnAdmConfiqRespQ - 
Add, modify or delete RAs on page 713 . for details using the SubnAdm- 
ConfigQ method for joining a multicast group. 28 

29 

When leaving a multicast group, the SubnAdmSetQ method can also be 30 
used by specifying the Delete Record in the edit modifier and the MC- 31 
MemberRID. Alternatively, one can specify the LLID, MGID, Q_Key, ^2 
MLID, MTU and P_Key. If a field is set to zero for a delete request, it will 
result in a match for that field in all the MCMemberRecords by Subnet ad- 
ministration. (Note: One could leave all the Multicast groups with one Sub- 
nAdmSetQ request by specifying just the LLID.) 35 

36 

01 5-6: An addition of a new MCMemberRecord or removal of a MCMem- 37 
berRecord implies that the SM shall program routers and switches with 33 
the new multicast information. 

40 
41 
42 
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15.2.5.19 GuidInfoRecord 



Table 173 GuidInfoRecord 



15.2.5.20 SAResponse 



Table 174 SAResponse 



Component 


Length(bits) 


Offset(bits) 


Description 


ByteCount 


32 


0 


Size of data transfer In bytes 


Checksunn 


32 


32 


Checksum of data transfer 


CurrentKey 


32 


64 


Current SA key 


TabteRID 


variable, each 
entry is 48 bits 


96 


A tuple of Attribute Identifier and offset RID for the 
given attribute record. The RID is delimited with an 
attribute modifier and offset RID tuple set to 
Oxffffffffffff. 


BulkData 


0 to limit 


varies 


Bulk data 



The SAResponse record is utilized by SubnAdmGetBulkResp() exclu- 
sively in the first packet response to indicate contents of the following 
datastream from the SA. 



15.3 Reliable Multi-Packet Transaction Protocol 



CI 5-10: The Multiple-packet transaction protocol specified in this section, 
15.3 Reliable Multi-Packet Transaction Protocol on page 696 . shall be 
used for SubnAdmGetBulkQ, SubnAdmGetTableQ, and SubnAdm- 
ConfigQ transactions, whenever their request or response spans multiple 
packets 



15.3.1 Subnet Administration MAD Data Field Usage 



The following fields of Subnet Administration MAD data field are used to 
support the multi-packet protocol. 



15,3.1.1 Segment Number Field 



The Segment Number field identifies the relative position of each packet 
within a multipacket request or response. Segment Numbers for multi- 



1 

2 
3 
4 



Component 


Length(bits) 


Offset(btts) 


Description 


5 


GuidlnfoRID 


32 


0 


RID of this record 


6 
7 


NodeRID 


32 


32 


RID of related Nodelnfo record 


8 


GUIDInfo 


512 


64 


Content of GUIDInfo attribute record 


9 
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15.3.1.2 Payload Length 



15.3.1.3 Fragment Flag 



packet requests and responses begin at segment number 1 ; the segment 
number of a single packet request or single-packet response is set to 0. 

For a description of the usage of the segment number field in acknowledg- 
ment packets, resend request packets, and KeepAlive packets, see 
15.3.1.3 Fragment Flag on page 697 . 



The payload length field is valid only for the first packet of a multipacket 
SubnAdmConfigO request, and the first and last packet of multi-packet 
SubnAdmGetBuIkO, SubnAdmGetTable() responses. In all other packets 
of a multipacket request or response, the payload length field is reserved. 

In the first packet of a SubnAdmConfigO request, the value in the payload 
length field indicates the sum of the lengths in bytes of the Admin Data 
fields in all packets which the requester is sending for the transaction. 

In the first packet of a multi-packet SubnAdmGetBulk(), SubnAdmGet- 
Table() or SubnAdmConfigO response, the payload length field indicates 
the expected sum of the lengths of the Admin Data fields in all packets of 
the entire multipacket response. 

In the last packet of a multipacket SubnAdmGetBulk(), SubnAdmGet- 
TableO or SubnAdmConfig() response, the payload length field indicates 
the number of valid bytes in the Admin Data field. 

After the first response to a SubnAdmGetBulk(), SubnAdmGetTable() re- 
quest is sent, the actual payload length may change. The payload length 
field in the last response packet indicates the number of valid bytes in the 
Admin Data field, and the Last Packet bit (bit 2) of the fragment flag is set 
to one. 



The fragment flag identifies the packet as either the first packet of a re- 
quest or response, a midstream packet, or a the last packet. The fragment 
flag is also used to specify other characteristics of the packet as indicated 
in Table 175 on page 697 . For single-packet request/response transac- 
tions, the fragment flag is set to zero. 

Table 175 Fragment Flag Description 



Bit Name 


Description 


0 First Packet: 


Set to one in the first packet of a nnulti-packet request and the first 




packet of a nnulti-packet response; set to zero otherwise. 


1 Reserved 
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Table 175 Fragment Flag Description 



Bit 



Name 



Description 



Last Packet: 



Set to one in the last packet of a multi-packet request and the last 
packet of a multi-packet response; set to zero othenwise. 



Resend Request: 



For a requester, this bit is set to one to request the responder to 
restart sending packets for a multi-packet response beginning with the 
Segment Number Indicated in the Segment Number field. For a 
responder. this bit is set to one to restart sending packets of a multi- 
packet request beginning with the Segment Number indicated in the 
Segment Number field. In all other cases, bit 3 is set to zero. For 
resend request packets, all fields in the MAD data field other than the 
fragment flag and segment number are ignored by the recipient. 



4 

5 



Reserved. 



Acknowledgment Packet: 



For a requester, this bit is set to one to acknowledge the receipt of all 
response packets up to and including the packet with the segment 
number indicated in the Segment Number field. For a responder, this 
bit is set to one to acknowledge the receipt of all request packets up 
to and including the packet with the segment number indicated in the 
Segment Number field. After receipt of a packet is acknowledged, 
resources associated with the packet are released and a request to 
retransmit the packet is not allowed. (See bit 3 above.) The segment 
number intervals at which acknowledgment packets are sent is not 
specified. For acknowledgment packets, all fields in the MAD data 
field other than the segment number, fragment flag, and window are 
ignored by the recipient. 



Reserved 



KeepAlive Packet: 



Set to one by the sender of a multipacket request or response to 
request the recipient to re-initialize the timer for the transaction to the 
segment timeout period. The sender may be either the requester, as 
in the case of a multipacket request, or the responder, as in the case 
of a multipacket response. For the keepAIive packet, all fields in the 
MAD data field other than the fragment field are ignored by the recipi- 
ent. 



Table Table 176 on page 698 gives examples of the use of the fragment 
flag. 

Table 176 Fragment Flag Usage Examples 



Type of 
Transaction 


Packet 


bits 0-7 


Single Packet only packet 
Request 




00000000 



1 

2 
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4 
5 
6 
7 
8 
9 
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23 
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29 
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31 
32 
33 
34 
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36 
37 
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Table 176 Fragment Flag Usage Examples 



15.3.1.4 Window 



15.3.1.5 Admin Data 



Tuna f\f 

lype or 
Transaction 


Packet 


bits 0-7 


Multi-packet 
Request 


first packet 


10000000 


nth (not last) packet 


00000000 




last packet 


00100000 




Acknowledgment packet sent by responder 


00000100 




Resend Request sent by responder 


00010000 


Single Packet 
Response 


only packet 


00000000 


Multi-packet 
Response 


first packet 


10000000 


nth (not last) packet 


00000000 




last packet 


00100000 




Acknowledgment packet sent by requester 


00000100 




Resend Request sent by requester 


00010000 



The window field is valid only in a multipacket SubnAdmGetBulk() or Sub- 
nAdmGetTableO request, and in an acknowledgment packet of a SubnAd- 
mConfigO transaction. 

The window value in a multipacket SubnAdmGetBulk() or SubnAdmGet- 
TableO request specifies the number of packets the responder may send 
before it receives an acknowledgment packet. When an acknowledgment 
packet is received by the responder, a number of additional packets (sub- 
sequent to the packet being acknowledged) equal to the window value 
may be sent. 

For a SubnAdmConfigO transaction, the window value in acknowledg- 
ment packets indicates the number of packets which the requester may 
send (subsequent to the packet being acknowledged) before it receives 
another acknowledgment packet. For the first burst of request packets of 
a SubnAdmConfigO transaction, the default window value of 64 is used by 
the requester. If the window value field in any acknowledgment packet is 
less than the default window value of 64, the default window value of 64 
Is used by the recipient. 



When sending a packet during a multipacket request or response, the 
admin data fields of all the packets contain the full 192 bytes of data ex- 
cept for the last packet. The last packet of the multipacket request or re- 
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sponse may contain fewer bytes as indicated by the payload length field. 1 

(See 15.3.1.2 Pavload Lenoth on pace 697 . ) 2 

3 

15.3.2 TllVIEOUTS 4 

The multi-packet protocol uses two timeout values which are specified in 5 

port attributes: g 



7 
8 
9 

10 
11 
12 



15 
16 
17 



• Portlnfo:SubnetTlmeout: This the maximum delay time from a 
port to any other port in the subnet. 

• SA ClassPortlnfo:ResponseTimeValue - 

• The maximum expected time within which the SA agent of a 
given port will respond to an SA request. 

• The maximum expected time within which an SA agent of a ^13 
given port will acknowledge the receipt of a packet after it re- 
ceives the last packet allowed by the window parameter dur- 
ing a multi-packet transaction. 

• The maximum expected time interval between the sending of 
subsequent packets of a multipacket SubnAdmConfig() re- 
quest or between the sending of subsequent packets of a ^ ^ 
multipacket SubnAdmGetTable(), SubnAdmGetBulk(), or 1 9 
SubnAdmConfigO response. 20 

The above timeouts are used to calculate the following timeout periods: 21 

22 

15.3.2.1 Response Timeout Period 23 

When a packet is sent which requires a response, the response timeout 24 

period is the maximum expected time within which the sender expects to 25 

receive the response. The response timeout period is used for the fol- 26 

lowing responses: 27 

• For the sender of a SubnAdmGetTable() or SubnAdmGetBulk re- 
quest, the response timeout period is the maximum time within which 
the requester expects to receive the first response to the request. 30 

For the sender of a multi-packet request or multipacket response 

which contains more packets than allowed by the Window field, the 32 

response timeout period is the maximum time within which the send- 33 

er expects to receive an acknowledgment packet after it sends the 34 

last packet allowed by the window parameter. When an acknowledg- 35 

ment packet is expected but not received, all resources associated gg 

with the transaction may be released. 2^ 

The value of the response timeout is equal to 2*(Portlnfo:SubnetTimeout 33 
of the local port) + ClassPortlnfo:ResponseTimeValue of the remote port, 

40 
41 
42 
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1 5.3.2.2 Segment Timeout Period 1 

The segment timeout period is the maximum expected time between the 2 
receiptof subsequent packets of a multipacket SubnAdmConfig() request, 3 
or a multipacket SubnAdmGetTable() or SubnAdmGetBulk() response. 4 

5 

The recipient of the multipacket request or response starts the segment g 
timeout period for the transaction whenever it receives a packet for the 
transaction. The segment timeout period is reinitialized and restarted 
whenever a packet is received as long as there are outstanding packets ^ 
expected. The KeepAlive packet will also reset the segment timer as any 9 
other response packet. 1 0 

11 

The value of the segment timer is equal to Portlnfo:SubnetTimeout of the ^ 2 
multipacket recipient + ClassPortlnfo:ResponseTimeValue of the multi- 
packet sender. 



13 
14 

15.3.3 Reliable Multi-packet Protocol Description 15 

The following two sections describe the reliable multi-packet protocol. 



16 
17 

15.3.3.1 Multi-Packet Protocol: Multi-Packet Response 18 

Figure 161 on page 702 shows a request/response transaction in which ^ ^ 
the response contains multiple packets. SubnAdmGetTable() and Sub- 20 
nAdmGetBuIkO transactions use this protocol. 21 

22 

In Figure 1 61 , the requester initiates the multi-packet protocol by sending 23 

a request with the fragment flag set to b'1 0000000' and the window field , 
set to n. When the request packet is sent, the requester initializes the re- 
sponse timer for the transaction. (See 15.3.2.1 Response Timeout Period 

on page 700 .) If the responder cannot return a response within a time pe- 26 

riod equal to the ClassPortlnfo:ResponseTimeValue of the port on which 27 

the request was received, it responds with a KeepAlive packet. Additional 28 

KeepAlive packets may be sent if additional time is required to send the 29 

first response packet. 2q 

OA 

If the requester does not receive a response packet after the response 
timer expires, a response timeout error is recognized. See 15.3.4 Error ^2 
Handling on page 705 . 33 

34 

When the responder has the response data, it sends the first n response 35 
packets. Each packet is sent within a time period of the ClassPortlnfo:Re- 
sponseTimeValue of the time when the previous packet was sent. Except 
for the optional KeepAlive packet, the segment number of each response 
packet increments starting from one. 

39 

Whenever the requester receives a response, it initializes the segment 40 
timer for the transaction. (See 15.3.2.2 Segment Timeout Period on page 41 
701 .) The segment timer is reinitialized and restarted whenever a re- 42 



InfinlBand^'^ Trade Association 



Page 701 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Subnet Administration 



October 24, 2000 
FINAL 



SubnAdmGetTableO 
SubnAdmGetBuIkO 



SA Requester 



SA Responder 



SubnAdmGetTableRespO 
SubnAdmGetBulkRespO 



Request (Window=n) 



ClassPortlnfc 
Response 
TimeValue 
ClassPortlnfo: 
ResponseTimeValue 



ClassPortlnfo:ResponseTimeValue 




lassPortlnfo: 
Response 
ClassPortinfo: TimeValue 
ResponseTimeValue 

ClassPortinfo: 
ResponseTimeValue 



f Reset 



Response Timeout 



Acknowledgment (segment=last) 



Figure 161 Multi-Packet Protocol: Multi-packet response 
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sponse packet or KeepAlive packet is received as long as there are out- 1 

standing responses expected. The segment timer is stopped when there 2 

are no outstanding responses expected. If the segment timer expires, a 3 

segment error is recognized. See 15.3.4 Error Handling on page 705 . ^ 

The responder continues sending packets until n packets have been sent. ^ 

Since there are additional packets to be sent for the response, the frag- ^ 

ment flag of the nth packet is set to b*00000000', to indicate that the 7 

packet is not the last packet. After sending the nth packet, the responder 8 

initializes the response timer waits for an acknowledgment packet. 9 



After receiving the nth packet, the requester sends an acknowledgment 
packet with a fragment flag equal to b'00000100' and a segment number 

equal to n. This packet acknowledges the receipt of the first n packets and ^ ^ 

allows the responder to release resources associated with them. Upon re- 1 3 

ceipt of this acknowledgment packet, the responder continues sending re- 1 4 

sponse packets for the request beginning with segment number n+1 . If no 15 

acknowledgment packet is received when the response timer expires, the ^ 5 
responder may release resources associated with the transaction. 

If the responder sends the last response packet before sending the next ^ ^ 

n packets, it sets the fragment flag to b'001 00000' to indicate that the last ^9 

packet has been sent. The Payload length field of the last packet is set to 20 

the number of valid data bytes present in the Admin Data field of the last 21 

packet. 22 

23 

When sending the last packet, the responder initializes the response 
timer. The responder retains resources associated with unacknowledged 
packets until the last packet is acknowledged or the response timer ex- 

pires because the requester may send a resend-request packet to recover 26 

lost packets. If the responder does not receive an acknowledgment before 27 

the response timer expires, the responder releases all resources for the 28 

request. 29 



30 
31 
32 



15.3.3.2 Multi-Packet Protocol: Multi-packet Request 

Figure 162 on page 704 shows a request/response transaction in which 
the request contains multiple packets. SubnAdmConfig() transactions use 
this protocol. 

34 

In the figure, the requester initiates the transaction by sending the initial 35 
request with a segment number of 1 . The payload length field is set to sum 36 
of the lengths in bytes of the Admin Data fields in all packets to be sent for 37 
the request. The requester then sends subsequent packets within a time 
period of ClassPortlnfo:ResponseTimeValue of the previous packet. The 
segment number of each packet is incremented by one. The requester 
may send 64 packets before receiving the first acknowledgment packet. 

41 
42 
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MAD Requester 



MAD Responder 



ClassPortlnfo: 
ResponseTime Value 



Request (first packet) 



Response Timeout 



Response Timeout 



Request (second packet) 



Request (third packet) 



(Requester may send 
up to 64 packets) 

Request (64th packet) 



ACK (Window = n) 



Request (next packet) 



(Requester may send 
up to n packets) 

Request (last packet) 



I 



Segment_timeout 



I 



Response Timeout 



SubnAdminConfigResponse (Transaction Complete) 

Figure 162 Multipacket Protocol: Multipacket Request 

As the multipacket request is received, tiie recipient checks to ensure that 
the segment number of each packet received is one higher than the pre- 
vious packet received. If a packet is received with a segment number 
which is not one higher than the previous packet, a segment error is rec- 
ognized. See 15.3.4 Error Handlino on page 705 . The recipient also ini- 
tializes the segment timer whenever it receives a packet. (See 15.3.2.2 
Seoment Timeout Period on oaoe 701 .) If the segment timer expires, a 
segment error is recognized. See 15.3.4 En^or Handling on page 705 . 

In the example shown, the requester sends 64 packets, which is the de- 
fault number of packets, and does not receive an acknowledgment 
packet. When sending the 64th packet, the requester initializes the re- 
sponse timeout period and waits for an acknowledgment packet. 
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15.3.4 Error Handling 



When the responder has sufficient buffer space available, it responds with 
an acknowledgment packet with the window value of n. This indicates to 
the requester that it may send n additional packets (subsequent to the 
packet being acknowledged). Upon receipt of this acknowledgment 
packet, the requester may send up to n additional packets. 

In the example shown, the requester finishes sending all of the packets of 
the multi-packet request before sending n additional packets. The last 
packet sent has a fragment flag of b'001 00000', indicating that the packet 
is the last packet of the request. When sending the last packet, the re- 
quester initializes the response timer. The requester retains resources as- 
sociated with unacknowledged packets until the SubnAdmConfigResp() is 
received or the response timer expires; this is necessary because the re- 
sponder may send a resend-request packet to recover lost packets. If the 
requester does not receive a resend-request packet or the SubnAdmCon- 
figRespO before the response timer expires, the requester recognizes a 
response timeout error. 

When the responder has received all of the packets, it sends a SubnAd- 
minConfigRespO packet to indicate that the transaction is complete. No 
other acknowledgment packet is sent. 



The following errors and the associated recovery are given: 



15.3.4.1 Response Timeout Error 



15.3.4.2 Segment Error 



A response timeout error is recognized when the response timer expires. 
See 15.3.2.1 Response Timeout Period on page 700 for a definition of the 
response timeout period. 

The recovery for a response timeout error is to resend the packet for 
which the response was expected, provided a vendor-specific number of 
retries have not been performed. When the maximum number of retries 
has been performed, resources associated with the transaction may be 
released. 

The maximum number of times a packet may be resent is 7. 



A segment error is recognized when the segment timer expires or when 
midstream packet is received during a multipacket transaction whose seg- 
ment number is not one greater than the previous packet received. 

The error recovery for a segment error is to 1) discard packets received 
with segment numbers greater than the segment number of the missing 
packet, 2) send a resend-request packet with the fragment flag set to 
b'00010000'. The segment number field of the resend request packet is 
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15.4 Operations 



set to the segment number of the packet which was expected but not re- 
ceived. This causes packet transmission to restart beginning with the 
packet whose segment number is equal to the segment number of the re- 
send-request packet. When the resent packet with the requested segment 
number is received, the recipient stops discarding packets. 

If a resend-request packet is received which contains a segment number 
of a packet for which resources have been released, status indicating an 
invalid attribute field is returned. 

The number of resend requests is vendor-specific with the limitation that 
the maximum number of resend requests which may be sent for a specific 
segment error is 7. When the vendor-specific number of resend requests 
have been sent, resources associated with the transaction may be re- 
leased. 



This section describes the operational aspects of SA. 



15.4.1 Restrictions ON Access 



There are two types of access restrictions involved in SA: Authenticating 
the requestor of information, and restricting the data that the requestor is 
allowed to receive. These are discussed below. 

15.4.1.1 Authenticating the Requestor 

The P_Key index in a request MAD is used by SA to authenticate the 
sender of a request, along with the GID and LID of the request MAD. This 
information can then be used to determine if, for example, the sender is 
or is not a valid (possibly standby) SM. 

15.4.1.2 Access Restrictions For PathRecords 

C15-11: Subnet Administration shall return to a requester only path 
records for which the source port, destination port, and requester all share 
a P_Key pairwise. See the remainder of this section (15.4.1 .2) for a de- 
tailed explanation. 
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• Two ports share a P_Key when there is at least one valid P„Key in 1 
one port's P_Key Table that matches a P_Key in the other port's 2 
P_Key Table. (See 10.9.3 Partition Key Matching on page 430 for the 3 
definition of P_Key matching, and 10.9.1.2 Special P Kevs on page ^ 
428 for the definition of a valid P_Key.) 

5 

• "PainA/ise" means that P_Key sharing must be present between three g 
pairs of ports: path source port and path destination port; path source ^ 
port and a port of the requester; path destination port and a port of 

the requester. Each of the three matches may be based on the ^ 

matching of different P_Keys. 9 

• The path source and destination ports used to determine sharing are 
the ones that are implicit in the SGID (or SLID) and DGID (or DLID) of 
the path. 12 



10 
11 



13 
14 



• The port of the requestor that is used to determine pairwise sharing 
may be any port of the node from which the request came. The path 
record should be returned if any port(s) of the requesting node, not 15 
just the port from which the request MAD came, provides the pairwise 1 6 
sharing described above. The requestor port sharing a P_Key with 17 
the source port need not be the same port as the requestor port shar- 
ing a P„Key with the destination port. 

• All ports involved in this determination must be on the subnet admin- 20 
istered by the Subnet Administrator to which the request is directed. 

1 5.4.1 -3 Access Restrictions For Other Attributes 22 

C15-12: When a requester node requests information from the Subnet 23 

Administrator about a subject node, the Subnet Administrator shall return 24 

only information about subject nodes for which the requester shares a 25 
P_Key, with exceptions noted below at Exceptions. 

Sharing is defined as follows: 27 

28 

• Two ports share a P_Key when there is at least one valid P_Key in 29 
one port's P_Key Table that matches a P_Key in the other port's 30 
P_Key Table. (See 10.9.3 Partition Kev Matching on page 430 for the 3^ 
definition of P_Key matching, and 10.9.1.2 Special P Kevs on page 
428 for the definition of a valid P_Key.) 



32 
33 

• The port of the requestor or the subject node that is used to deter- 
mine sharing may be any port of either the requester node or the sub- 
ject node. The information should be returned if any port(s) of the 
requesting node, not just the port from which the request MAD came, 
provides the sharing described above. 37 

• All ports involved in this determination must be on the subnet admin- 
istered by the Subnet Administrator to which the request is directed. 39 

Exceptions: PortlnfoRecords are always provided with the M_KEY com- 
ponent set to 0, except in the case of a trusted subnet manager; in that 41 

42 
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case the actual M_KEY component contents shall be provided. Trust of 1 

other subnet managers is implied by earlier provision of a valid SM_KEY 2 

previously by the requester during the operations leading to the establish- 3 

ment of the SM master on the subnet. PortlnfoRecords with complete ^ 
M_KEY information shall be openly shared between trusted SMs. Parti- 
tionRecords shall be openly shared between trusted SMs. 

6 

15A2 Locating Subnet Administiration 7 

C15-13: It shall be possible to determine the location of SA by determining ^ 
the location of the SM, using the content of Portlnfo:MasterSmLID, pro- 9 
viding the SM LID. 10 

11 

C15-14: By performing a SubnAdmGet(ClassPortlnfo), all information 12 
needed to communicate to Subnet Administration shall be obtained. 

14 
15 

C1 5-1 5: SA_KEY shall be provided by the subnet administration methods ^ g 
to serve as a versioning key, in which the SA_KEY value provided is com- 
pared with a previous SA_KEY, and the SA can optionally provide only 
records added since the provided key was issued. 18 

19 

Simple SAs will provide all records regardless of key value. This works for 20 
SubnAdmGetBulkQ and SubnAdmGetTableQ methods. 21 

22 
23 

15.4.4.1 SubnAdmInformO, SubnAdmInformResp(), SubnAdmReport(), & SubnReportResp() 24 

The event forwarding subsystem exists for the purpose of subscribing for 25 
the subnet management class of traps from the SA. 26 

27 

C15-16: Event forwarding operations directed at SA shall conform to the 
common methods as described in 13.4.5 Management Class Methods on 
page 577 . 29 

30 

15.4.5 Administration Query Subsystem 31 

15.4.5.1 Component Mask 32 

In the administration query subsystem the 64 bit component mask in the 33 
SA MAD is used in query operations to specify particular attribute compo- 34 
nents to query on. The component mask can refer to only an entire com- 35 
ponent, not elements or parts of a component. 35 

37 
38 



15.4,4 Event Forwarding Subsystem 



C15-17: In the component mask, for query operations the 0 bit must refer 
to the first component, the 1 bit must refer to the second element, and so 
forth. 39 

40 

CI 5-1 8: When a component mask bit is set to 1 in a query, the component 41 
must be matched in all responses to a query operation. When a compo- 42 
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nent mask bit is set to 0 in a query, all records otherwise matching must 1 
be returned regardless of the setting of that component. (There is an ex- 2 
ception for PathRecords; see 15.2.5.16 PathRecord on page 686 .) 3 



15.4.5.2 Attribute and Attribute Modifier Use 



Query and editing of tables (groups of RAs) is done with the use the At- 
tribute ID and Attribute Modifier fields. 



Consequently, an attribute modifier of Oxffffffff cannot be specified for 
query of RID values. 



C15-19: For edit operations the component mask shall be used to refer to 
a specific component to be edited. Such editing is only valid using the 

SubnAdmConfigO method. As with query operations, the bits all map to ^ 

specific components to be edited, the 0 bit refers to the first component, 7 

the 1 bit refers to the second component, and so on. If the component 8 

mask bit for a component is set to 0, that component must be ignored in g 
the edit operation: no change is made to that particular attribute compo- 
nent is made. 

A component mask of all ones means that all components are to be used ^ ^ 

for a query. For an edit, a component mask of all ones means that all com- 1 3 

ponents are to be edited. The result of using a component mask of 0 for a 14 

query or an edit operation is undefined. 1 5 

16 

For the event forwarding subsystem the component mask is unused. ^ j 



18 
19 
20 
21 

The Attribute ID is used to reference the table to query or edit. In bulk op- 22 
erations the Attribute ID is unused, and set to 0. 23 

24 

The Attribute Modifier and End RID are used to indicate ranges of RAs 25 
based on a RA RID (not the related RID). 

See 15.2.5 Attributes on page 678 for more information on RAs. 

28 

15A6 SubnAdmGetTableO & SubnAdmGetTableResp() 29 

SubnAdmGetTableQ is used to request an RA table. Operations are al- 30 
lowed on a specific table only, specified by attribute identifier. 31 

32 

15.4.6.1 Query by Template 33 

CI 5-20: The Attribute modifier of Oxffffffff shall indicate a query by tern- 34 
plate rather than by RID. 35 

36 
37 
38 

A query by template uses the RA specified by the Attribute identifier (also 39 
called the Attribute ID) to determine the query format contained in the SA 40 
MAD data area of the request. Such requests for an exact match of a 41 
single component, such as a ServiceRecord:ServiceName Is indicated 42 
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MAD HEADER 



GetTableQ 



Attribute=NodeRecord 



Attribute Modifier = OxFFFFFFFF 



EndRID = 0 



ComponentMask = Bit 2 set to 1, all else C 



Node RA Template 
all 0 except: 



NodeType = 3 



Figure 163 Search by NodeRecord Template to search for all routers 

with that value being set to the value to be searched for, the component 
mask being set to indicate the ServiceName component as the attribute 
component to be searched for, and all other values in the template are ig- 
nored. 

• Attribute Modifier set to Oxffffffff - If the attribute modifier is set to 
Oxffffffff, this is interpreted as a query indicated by the RA(s) fol- 
lowing in the data area. 

• Component set to exact value or values to match 

• All other component values are ignored, and may be set to 0. 

Deleted RAs are indicated with the major RID intact, but all val- 
ues set to 0 in that RA 



A a query by template for NodeRecords with a NodeGuid of 3 would 
supply the data shown in Figure 163 on page 710 and Table 177 on oage 
710 in the request: 

Table 177 SubnAdmGetTable query for all NodeRecords with a specific NodeGUID 



Component 


Value 


Interpretation 


Header:AttributelD 


0x0011 


Specify NodeRecords. 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
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Table 177 SubnAdmGetTable query for all NodeRecords with a specific NodeGUID 



Component 


Value 


Interpretation 


ncciud .Mill luuic iviiyuiiici 


11 V f rffrrff 

V/AI nil 1 1 1 


wuciy ofJcuiiicu III oiiriuuic iicauci. 


HeaderEndRID 


0x0000 


Value is ignored by SA. 


HeadenComponentMask 


bit 4 set to 
1 , all others 
0 


Set bit 4 to one, all else to 0 


NodeRecord:NodeGUID 


3 


Obtain all NodeRecords with NodeGUID 
value of 3. 


All other NodeRecord com- 
ponents 


0 


Ignore this field. 



15.4.6.2 Query by RID Range 

CI 5-21: A SubnAdmGetTableO with an attribute modifier not equal to 
Oxffffffff shall indicate a query by RID range. 

CI 5-22: A query by RID range shall use the attribute modifier as the start 
of the RID range, and the EndRID component as the end of the range re- 
turned, inclusive. 

The attribute value indicates the type of record returned in a query by RID 
Range. (For example, for the NodeRecord RA, the Attribute ID value is 
0x0011). 

The two following tables specify an example of a query, and a range re- 
quest.: 

Table 178 SubnAdmGetTable query for all NodeRecords within a given range 



Component 


Value 


Interpretation 


Header:AttributelD 


0x0001 


Specify NodeRecords. 


HeaderAttribute Modifier 


0x003000 


Get records starting at NodeRID 0x30 or 
greater 


HeaderEndRID 


0x004000 


Get records ending at NodeRID 0x40 or 
below. 


HeaderComponentMask 


0x0 


Data area is set to 0, ignored by SA 


AIINodelnfo components 


0 


Data area is all set to 0, ignored by SA 



15.4.6.3 Requesting all records of a table 



To request all records of a table, the Attribute ID is set for the table desired, 
the SA_KEY Is set to 0, and the Attribute modifier is set to 0 and EndRID 
value is set to Oxffffffff. 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
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15.4.6.4 Requesting all new table records since last request 1 

C15-23: A SubnAdmGetTableQ with an SA_KEY of 0 shall return the en- 2 
tire table specified by the request, with the current value of the SA_Key. 3 

4 

C1 5-24: A SubnAdmGetTableQ with a non-zero SA_Key that was current 5 
at a prior time shall return either the changes made to the table specified g 
by the request since the provided SA_Key was current; or the entire table 
specified by the request; with the current value of the SA_Key. 

8 

In this way the SA_KEY can be used to limit the records returned to only 9 
the records added since last query. 10 

11 

C15-25: A SubnAdmGetTableQ with a non-zero SA_Key that has never ^2 
been current will return an empty response (no records) with a status field 
indicating invalid attribute. 



13 
14 
15 
16 



15,4.7 SubnAdmGetTableRespQ 

Subnet Administration uses the SubnAdmGetTableRespQ to respond to 
all SubnAdmGetTableQ queries. 

18 

01 5-7: SA may indicate a refused request by returning a SubnAdmGetTa- 1 9 
bleRespQ with the status field providing the reason for refusal. 20 



21 
22 



A SubnAdmGetTableQ and the corresponding SubnAdmGetTableRespQ 
is illustrated in Figure 164 on pace 713 . Subsequent SubnAdmGetTable- 
RespQ M ADs of this transaction will have the FragmentFlag and Segment ^3 
Number values set to indicate place in the data stream. The continuation 24 
of the data to is reassembled by the requester in its own data area. 25 

26 

The EndRID value for SubnAdmGetTableRespQ is undefined. 27 



28 
29 



The RAs for a SubnAdmGetTableRespQ for the example of Figure 164 on 
page 713 are contained within a SA MAD as shown. Note that records 
may be broken across successive response MADs. 30 

31 

15.4.8 SubnAdmGetBulkQ & SubnAdmGetBulkRespO- Bulk Table Retrieval 32 

o15-8: If implemented, SubnAdmGetBulkQ shall return all records cur- 33 
rently held by SA. 34 

35 

SubnAdmGetBulkQ has no implied attribute; the data payload is all set to 35 
zeroes, and is ignored on receive. The Attribute Modifier is ignored by the ^7 
receiver. 

38 

o15-9: SubnAdmGetBulkRespQ MADs shall have a data area consisting 39 
of the SAResponse Attribute, followed by the actual record tables. 40 

41 
42 
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MAD HEADER 



GetTableRespO 




SubnAdmGetTableRespO 
MAD returning single 
PortlnfoRecord 



SubnAdmGetTableRespO 
MAD returning multiple 
RouterRecords 



Figure 164 Example GetTableResponse() Layout 

O15-10: If SA has no data, SubnAdmGetBulkRespQ shall return with the 
Attribute Modifier field is set to all zeros, and the data area is set to all 
zeros. 

015-11: SubnAdmGetBuIkO shall use the SA_Key with the same seman- 
tics as SubnAdmGetTableQ. 

An example of a SubnAdrnGetBulkRespQ is illustrated in Figure 165 on 
page 714 . 

15.4.9 SubnAdmConfigO & SubnAdmConfigRespQ - Add, modify or delete RAs 

SubnAdmConfigQ provides the ability to add, delete, or modify entire SA 
tables, in contrast to SubnAdmSetQ, which only operates on single 
records. The RAs supplied with the SubnAdmConfigQ are the RAs to be 
operated on, subject to component mask settings. The RIDs in those RAs 
indicate the RAs to be operated on. 
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MAD HEADER 






GelBuIkRespO 


1 




Attribute = 0 1^ 




Attribute Modifier = 45 \m 




EndRID = 0 











SAResponse 



PortlnfoRecord 



)ER 



PortlnfoRecord 



• • • • 



PortlnfoRecord 




ord 




cord 



cord 



cord 



HEADER 



foRecord 



foRecord 



foRecord I ^Record 



foRecord 



EADER 




Record 



)Record 



^Record 



SubnAdmGetBulkRespO 
MAD returning multiple 
Attribute Records 



Figure 165 Example GetBulkRespQ Layout 

015-12: State records (see Table 149 Subnet Administration Attributes 
(Summary) on page 679 ) shall only be modified by SubnAdmConfigQs 
sent from the master subnet manager 

Hence the target of a SubnAdmConfigQ can only be a non-master subnet 
manager A use for this is for a master SA to "push" data to a standby. 

o15-13: If the number of records supplied in a SubnAdmConfig() is fewer 
than the number to fill the RID range specified by the attribute modifier and 
the EndRID, the final record sent is repeated until the end of the range. 

Note that the previous paragraph allows editing the attributes of a range 
of records by sending a single record, since the component mask can be 
set to modify only the specific desired components. 
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15.4.9.1 SubnAdmConfigRespQ 

015-14: SubnAdmConfigRespQ shall contain an adnnin data field set to all 
Os, and a status field indicating the success or failure of the corresponding 
SubnAdmConfigQ. 

15A10 SubnAdmGetO & SubnAdmGetResp(): Get an RA 

CI 5-26: In response to a SubnAdmGetQ with a non-zero RID contained 
in the attribute modifier, SubAdmGetRespQ shall return the RA with that 
RID, subject to the access rules specified in 15.4.1 Restrictions on Access 
on paoe 706 . 

CI 5-27: In response to a SubnAdmGetQ with a zero in the attribute mod- 
ifier, SubAdmGetRespQ shall return the RA matching the components 
supplied and the component mask, subject to the access rules specified 
in 15.4.1 Restrictions on Access on page 706 . 

CI 5-28: The SA_Key value provided in a SubnAdmGetQ must be ig- 
nored, and the SA_Key returned in a SubAdmGetRespQ shall be zero. 

015-15: If more than one RA would be returned as a result of matching, 
SubAdmGetRespQ shall return a status of ERR_REQ_INVALID. 

Table 179 SubnAdmGet query for a NodeRecord on page 715 shows an 
example SubnAdmGetQ query for a NodeRecord. 

Table 179 SubnAdmGet query for a NodeRecord 



Component 


Value 


Interpretation 


HeaderAttributelD 


0x0011 


Specify Nodelnfo records. 


Header:Attribute Modifier 


Oxffffffff 


Signifies query using data matching in sup- 
plied attribute. 


HeaderEndRID 


0x0000 


Set to 0, ignored by SA. 


Nodelnfo: PortGUID 


0x0004 


Obtain NodeRecord with PortlnfoGID value of 
4. 


All other Nodelnfo compo- 
nents 


0x0000 


All set to 0, SA will Ignore these fields. 



15.4.11 SubnAdmSetQ: Set AN RA 

CI 5-29: In response to a SubnAdmSetQ with an edit modifier of add and 
a zero RID in the attribute modifier, the RA contained in that MAD will be 
added to the SA, and a SubAdmGetRespQ shall be returned containing 
the RA provided along with the RID of that RA, subject to the access rules 
specified in 15.4.1 Restrictions on Access on paoe 706 . 
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C15-30: A SubnAdmSetQ with an edit modifier of edit sliall not be per- 1 

formed. 2 

3 

C15-31: If a SubnAdmSetQ with an edit modifier of delete is received by ^ 
the SA containing the entire attribute record to delete, with all components 
matching, and the attribute modifier is set to the value of the major RID, 

then that RA shall be deleted and SubAdmGetRespQ is returned with a ^ 

zero status value, provided that the requestor is allowed access to that 7 

record according to the access rules specified in 15.4.1 Restrictions on 8 

Access on page 706 . g 



5 



10 
11 



015-16: If SubnAdmSetQ with an edit modifier of delete is in any way am 
biguous to the SA, or access to it would violate access rules, a null re- 
sponse must be returned to the requester with a status of ^ ^ 
ERR_REQ_INVALID. 13 
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Chapter 16: General Services i 

2 

3 
4 
5 

This chapter describes the range of management services that the IBA g 
provides under general services, except for the Subnet Administration 
which is described in the previous chapter. General management services ^ 
provide the following management classes: ^ 

9 

Performance Management - provides methods that enable a manag- 10 
er to retrieve performance statistics and error information from IBA \^ 
components. ^2 

• Baseboard Management - provides a means to transport messages 13 
to components beyond the subnet, to "out of band" components. An ^4 
example might be to chassis temperature monitoring and control ^ ^ 
hardware on an IBA channel adapter. 

1 6 

• Device Management - provides the means to perform I/O controller / ^ ^ 
I/O unit management. This class defines the mechanisms to send ^ 
and receive device management packets between two subnet-at- 
tached points, typically between an HCA and a TCA. The TCA pro- ^ ^ 
vides an interface to the I/O controller and I/O device. 20 

• SNMP Tunneling - provides a set of methods, data formats and at- 
tributes to support SMNP tunneling. The SNMP packet is embedded 22 
in the IBA-compliant management datagram. 23 

Vendor Specific - provides a set of general purpose methods. Ven- 

dors are free to define new methods and attributes, however they 25 

conform to management datagram formats and restrictions described 26 

herein. 27 

• Application Specific - provides a set of general purpose methods. Ap- 28 
plications are free to define new methods and attributes, however 29 
they conform to management datagram formats and restrictions de- 30 
scribed herein. 2^ 

• Communication Management - provides the mechanisms to estab- 32 
lish, terminate, and migrate connections between nodes, and pro- 33 

vides basic service ID resolution. ^ . 

34 

16-1 Performance Management 

36 

01 6-1 : The Performance Management Agent is mandatory on all nodes. 

00 

The Performance Management class provides mechanisms to enable a 
performance management entity to retrieve performance and error statis- 39 
tics from InfiniBand components. Performance quantities are divided into 40 
two classes: 41 

42 
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16,1.1 MAD Format 



Mandatory for all ports of all nodes (TCAs, HCAs, Switches, and 
Routers). These quantities are deemed necessary to support funda- 
mental instrumentation and performance analysis of a multi-vendor 
InfiniBand fabric 

Optional. These quantities may be implemented at the vendor*s dis- 
cretion, and are described here as an aid to standardization. 



CI 6-2: The datagrams in the Performance class shall conform to the MAD 
format and use as specified in 13.4 Management Datagrams on page 573 
and further customized in Figure 166 Performance Management MAD 
Format on page 718 and Table 180 Performance Management MAD 
Fields on page 718 below. 

Figure 166 Performance Management MAD Format 



bytes 




0 


Common MAD Header 




20 


24 


Reserved 




60 


64 


Data 




252 



Table 180 Performance Management MAD Fields 



Field Name 


Length 


Description 


Common MAD Header 


24 bytes 


Commor^ MAD Header as described in 13.4.2 Manaaement Dataaram Format on 
DBoe 574 


Reserved 


40 bytes 


This field is reserved and shall be set to zeroes. 


Data 


192 bytes 


Attribute data is mapped bit for bit from the format described in the following sec- 
tions to the start of this data field. If the attribute is smaller than the data field, the 
content of the remainder of the data field is unspecified. 
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16.1.1.1 Status Field 



The Status field is described in 13.4.7 Status Field on page 587 . No class- 
specific bits are defined. 

Table 181 Performance Management Status Field 



Bits 



Name 



Meaning 



0-7 



Common bits as defined in 13.4.7 Status Field on page 587 



8-15 



Class-specific bits are reserved 



16.1-2 Methods 



The Performance Management class uses a subset of the common 
methods described above in section 13.4.5 Manaoement Class Methods 
on pace 577 : 



Table 182 Performance Management Methods 



Method Type 


Value 


Description 


PerformanceGetO 


0x01 


Request a get (read) of a class specific information attribute 


PerformanceSetO 


0x02 


Request a set (write) of a class specific information attribute. 


PerformanceGetRespO 


0x81 


Response from a get or set request. 



16.1.3 Mandatory ATTRIBUTES 



Performance Management defines the mandatory Attributes and Attribute 
Modifiers use summarized in Table 183 Mandatory Performance Manage- 
ment Attributes on page 719 . They are described in detail in the sections 
following the table. The use model for these attributes is described in 
16.1.3.6 Mandatory Performance Attribute Use Model on page 734 . 

Table 183 Mandatory Performance Management Attributes 



Attribute Name 


Attribute 
ID 


Attribute Modifier 


Description 


Length 


ClassPortlnfo 


0x0001 


0x00000000 


See 13.4.8.1 ClassPortlnfo on oaae 589 




PortSamplesControl 


0x0010 


Selects one of n independent 
sampling mechanisms; zero 
(0) must be implemented. 


Port Performance Data Sampling Control 


68 bytes 


PortSamplesResult 


0x0011 


Selects one of n independent 
sampling mechanisms; zero 
(0) must be implemented. 


Port Performance Data Sampling Results 


64 bytes 


PortCounters 


0x0012 


0x00000000 


Port Basic Performance & Error Counters 


40 bytes 


Reserved 


0x0013- 
0x0014 


OxOOOOOOOO-OxFFFFFFFF 


Reserved 
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Table 184 Mandatory Performance Management Attribute / 

Method Map 



Attribute Name 


PerformanceGet 


PerformanceSet 


ClassPortlnfo 


X 


X 


PortSamplesControl 


X 


X 


PortSamplesResult 


X 


X 


PortCounters 


X 


X 



16.1.3.1 ClassPortInfo 



The ClassPortlnfo attribute is described in 13.4.8.1 ClassPortlnfo on page 
589 . No class-specific bits are defined. 



Table 185 Performance Management ClassPortlnfo:CapabllltyMask 



Bits 



Name 



Meaning 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



0-7 



Common bits as defined in 13.4.8.1 ClassPortlnfo on oaae 589 



AllPortSelect If reported as 1 , indicates that all attributes containing the PortSelect component support setting it to 
OxFF to gather data from all ports at once. If reported as 0, using OxFF in PortSelect results in unde- 
fined behavior. 



9-15 



Class-specific bits are reserved 



16.1.3.2 PortSamplesControl 



The PortSamplesControl attribute is mandatory. It provides a means of ini- 
tiating a sample and selecting, for one selected port during the specified 
interval, quantities to be sampled such as: 

• The amount of data sent and received 

• The number of packets sent and received 

• The transmit queue depth at the start of the interval 

The complete list of quantities that can be sampled using this mechanism 
is given in Table 187 SampleSelect Values . 

Sampling is initiated by means of a PerformanceSet(PortSamplesCon- 
trol). Sampling status and results are obtained by means of a Perfor- 
manceGet(PortSamplesResult). 

To support random sampling that is decoupled from MAD latencies and 
other port activities at either the sender or receiver, the PortSamplesCon- 
trol attribute provides a means to specify a delayed start time for the 
sample interval. See the SampleStart component in Table 186 PortSam- 
plesControl . 
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Performance sampling operations are based on a standard time interval 
called a tick. A tick is a multiple of the link transfer period. For example, a 
multiple of 400 picoseconds for a link running at 2.5 giga-transfers per 
second. Implementers are given a range of multipliers to choose from. 

The Attribute Modifier selects one of several possible independent sam- 
pling mechanisms. 

C16-3: All nodes shall implement PortSamplesControl and PortSamples- 
Result corresponding to an Attribute Modifier of zero. Implementation of 
additional sets of PortSamplesControl and PortSamplesResult permits si- 
multaneous sampling of multiple ports, and shall use ascending Attribute 
Modifier values starting with one (1 ). The number of additional sets imple- 
mented is defined in PortSamplesControl.SampleMechanisms. A special 
value of Attribute Modifier (OxFFFFFFFF) selects all mechanisms at once. 

C16-4: For each sampling mechanisms, at least one and up to 15 
counters shall be implemented.. 

Table 186 PortSamplesControl 



Component 


Access 


Length 
(bits) 


Description 


Opcode 


RW 


8 


Used to select a specific packet op code (as found In BTH) when sampling 
optional quantities that are op code specific. If OpCode is OxFF, all op codes 
are sampled as one otherwise only one op code can be sampled at a time, 
although multiple quantities can be sampled for the same op code. 


PortSelect 


RW 


8 


Selects which port will be sampled. For an HCA or TCA, PortSelect refers to 
an end port. For a switch, PortSelect refers to a switch port. If the value does 
not correspond to an actual port, the sample timers run normally but the result- 
ing sample counter values are zero. 

If gatherinq data from all oorts at once is suDDorted (see Table 185 Perfor- 
mance Manaaement ClassPortlnfo:CapabilitvMask on oaae 720). setting Port- 
Select to OxFF will cause samples from alt ports to be accumulated. 


Tick 


RO 


8 


Indicates the node's sampling clock interval as a multiple of 10x the link trans- 
fer period. For a 2.5 Gtransfer link, the transfer period is 400 picoseconds. The 
encoding is: 

0x00 = 10 X link transfer period (4 nanoseconds for a 2.5 Gtransfer link) 
0x01 = 20 X link transfer period 
0x02 = 30 X link transfer period 

OxFF = 2,560 X link transfer period 

To maximize utility of the perfomnance attributes, Implementers are encour- 
aged to choose the smallest practical tick size 


Reserved 


RO 


5 


Reserved, shall be zero. 
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Table 186 PortSamplesControl 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
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21 
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23 
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25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
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40 
41 
42 



Component 



Access 



Length 
(bits) 



Description 



CounterWidth 



RO 



Indicates the actual width in bits of the following components: 

- SampleStart 

- Samplelnterval 

- PortSamplesResult:CounterO to 14 
The encoding is: 

0 = 16 bits 

1 = 20 bits 

2 = 24 bits 

3 = 28 bits 

4 = 32 bits 

5 - 7 = reserved 

Counters smaller than 32 bits shall be implemented as the least significant bits 
of the coH'esponding 32-bit attribute component, with the unimplemented 
upper bits of the component returning zeroes for Get and ignored for Set. 



Reserved 



RO 



Reserved, shall be zero. 



CounterOMask 



RO 



A bitmask that determines the capabilities of PortSamptesResultiCounterO. 

Bit 0 = supports all mandatory quantities; shall be 1 

Bit 1 = supports optional quantities 

Bit 2 = supports vendor-defined quantities 



CounterMasks1to9 



RO 27 An array of nine 3-bit bitmasks. each of which determines the capabilities of an 

optional counter in PortSamplesResult. The most significant 3-bit field corre- 
sponds to PortSamplesResult:Counter1 ; the least significant field corresponds 
to PortSamplesResult:Counter9 
Encoding: 

Bit 0 = supports all mandatory quantities 

Bit 1 = supports optional quantities 

Bit 2 = supports vendor-defined quantities 

All bits zero means the counter is not implemented 



Reserved 



RO 



Reserved, shall be zero. 



CounterMasks10to14 



RO 15 An array of five 3-bit bitmasks, each of which determines the capabilities of an 

optional counter in PortSamplesResult. The most significant 3-bit field corre- 
sponds to PortSamplesResult:Counter10; the least significant field corre- 
sponds to PortSamplesResult:Counter14 
Encoding: 

Bit 0 = supports all mandatory quantities 

Bit 1 = supports optional quantities 

Bit 2 = supports vendor-defined quantities 

All bits zero means the counter is not implemented 
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Table 186 PortSamplesControl 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



Component 



Access ^t"?.*" 
(bits) 



Description 



SampleMechanisms 



RO 



The number of independent sample mechanisms implemented (i.e. sets of 
PortSamplesControl and PortSamplesResult), minus one: 

0 = one sample mechanism Is available (addressed via Attribute Modifier zero) 

1 = two sample mechanisms are available, Attribute Modifiers 0 and 1 

255 = 256 sample mechanisms are available, addressed via Attribute Modifi- 
ers 0 through 255 

Providing multiple sampling mechanisms is optional. N sample mechanisms 
would permit N independent samples to be run simultaneously. A special value 
of the Attribute Modifier (OxFFFFFFFF) allows all sample mechanisms to be 
started with a single Set, sampling the same quantities during the same inter- 
val on N ports 



Reserved 



RO 



Reserved, shall be zero. 



SampleStatus 



RO 



Indicates the status of sampling: 

0 = sampling is complete and the results are available from the PortSamples- 
Result attribute 

1 = the SampleStart timer is running. All sample counter values in PortSam- 
plesResult are undefined 

2 = sampling is underway. All sample counter values in PortSamplesResult are 
undefined 

3 = reserved 

While SampleStatus is non-zero, a PerformanceSet (PortSamplesControl) will 
not affect PortSamplesControl and will return the existing values of all compo- 
nents 
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Table 186 PortSamplesControl 



1 

2 
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4 
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23 
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25 
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27 

28 

29 
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31 

32 

33 

34 

35 
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38 

39 

40 

41 

42 



Component 



Access "-^^^t*! 
^^^^^^ (bits) 



Description 



OptionMask 



RO 64 A bit mask indicating which optional InfiniBand performance quantities are 

implemented, if any. See Table 187 SampleSelect Values for a description of 

each quantity or set of quantities: 

Bit 0 (LSB) = reserved shall be zero. 

Bit 1 = PortXmitQueue[n] 

Bit 2 = PortXmitDataVL[n] 

Bit 3 = PortRcvDataVL[n] 

Bit 4 = PortXmitPktVL[n] 

Bit 5 = PortRcvPktVL[n] 

Bit 6 = PortRcvErrorDetails:PortLocalPhysicalErrors 

Bit 7 = PortRcvErrorDetails:PortMalformedPacketErrors 

Bit 8 = PortRcvErrorDetails: PortBufferOverrunErrors 

Bit 9 = PortRcvErrorDetails: PortDLIDMappingErrors 

Bit 10 = PortRcvErrorDetails: PortVLMappingErrors 

Bit 11 = PortRcvErrorDetails: PortLoopingErrors 

Bit 12 = PortXmitDiscardDetails: PortlnactlveDiscards 

Bit 13 = PortXmitDiscardDetails: PortNeighborMTUDIscards 

Bit 14 = PortXmitDiscardDetails: PortSwLifetimeLimitDiscards 

Bit 15 = PortXmitDiscardDetails: PortSwHOQLifetimeLimitDiscards 

Bit 16 = PortOpRcvCounters: PortOpRcvPkts 

Bit 17 = PortOpRcvCounters: PortOpRcvData 

Bit 18 = PortFlowCtlCounters: PortXmitFlowPkts 

Bit 19 = PortFlowCtlCounters: PortRcvFlowPkts 

Bit 20 = PortVLOpPackets: PortVLOpPackets[n] 

Bit 21 = PortVLOpData: PortVLOpData[n] 

Bit 22 = PortVLXmitFlowCtlUpdateErrors: PortVLXmitFlowCtlUpdateErrors[n] 

Bit 23 = PortVLXmitWaitCounters: PortVLXmitWait[n] 

Bits 24 - 47 Reserved shall be zero. 

Bit 48 = SwPortVLUnkDests: PortVLUnkDests[n] 

Bits 49 -63 Reserved shall be zero. 

Performance quantities that are counted per VL are limited to the actual num- 
ber of VLs implemented. The result of selecting an unimplemented quantity is 
all zeroes. 



VendorMask 



RO 64 A bitmask indicating which vendor-specific counters are implemented. Must be 

zero if the node does not support any vendor-specific counters, otherwise use 
is vendor-defined 
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Table 186 PortSamplesControl 



1 
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Component 



. Length 
Access .... . 

(bits) 



Description 



SampleStart 



RW 32 Determines when the sampling interval starts. When Set, this value is loaded 

into a timer and the following events occur: 

- SampleStatus is set to 1 

- Counters in PortSamplesResult are set to zero 

- The timer begins decrementing once per tick. 

When the timer reaches zero, timing stops and the following events occur: 

- The PortXmitQueue quantities if selected are latched 

- PortSamplesResult counters are started 

- SampleStatus is set to 2 

- The Samplelnterval timer is started 

The SampleStart timer allows a performance application to randomize the 
sample start time and insure decoupling from node or network events. Values 
used will typically be 10's of milliseconds. It is the fine granularity of this inter- 
val with respect to the link rate that makes decoupling possible 



Samplelnterval 


RW 


32 


Determines the length of the sampling interval. When Set, this value is loaded 

into a timpr Whpn thp RamnlpStart pniintpr rpflphpQ 7pm thiQ timpr hpnin^ 

CJ llllld. VVIIdl 11 well 1 l|JICOiCII I \^UUIHCI 1 Cd^l ICO £.KSI\J, lllld lllllCl UO^IIIO 

decrementing once per tick. When it reaches zero, timing stops and the follow- 
ing events occur: 

- PortSample2 counters are stopped and the resulting values made available 

- SampleStatus is set to zero 


Tag 


RW 


16 


Used by a performance application when it does a PerformanceSet (PortSam- 
plesControl) to uniquely identify its sample run in case of a collision with 
another performance application 

When an application wishes to start a sample run, it should pick a random Tag 
value and do a PerformanceSet (PortSamplesControl). If the returned value of 
Tag does not match the selected value, another application is using the sam- 
pling mechanism. In this case the first application must wait for a suitable time 
and retry its sample 


CounterSelectO 


RW 


16 


Selects quantity to be sampled by PortSamplesResuIt:CounterO as defined in 
Table 187 SampleSelect Values . If an unimplemented quantity is selected, a 
Get to PortSamplesResult:CounterO returns zeroes 


CounterSelectI 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter1 


CounterSelect2 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter2 


CounterSelect3 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter3 


CounterSelect4 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter4 


CounterSelectS 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter5 
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Table 186 PortSamplesControl 



Component 


Access 


Length 
(bits) 


Description 


CounterSelecte 


RW 


16 


Similar to CounterSelectO; selects quantity to be sannpled by PortSamplesRe- 
sult*Counter6 


CounterSelect? 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult'Counter? 


CounterSelectS 


RW 


16 


Simitar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
suIt'Countprfi 


CounterSelectG 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 

^ult'CnuntprQ 


CounterSelectIO 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 


CounterSelectH 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter11 


CounterSelect12 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter12 


CounterSelect13 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter13 


CounterSelect14 


RW 


16 


Similar to CounterSelectO; selects quantity to be sampled by PortSamplesRe- 
sult:Counter14 



16.1.3.3 SampleSelect Values 



Table 187 on page 727 lists the values that can be used in the CounterSe- 
lect[n] components of the PortSamplesControl attribute to select a partic- 
ular quantity to sample. 

Quantities that can be sampled are divided into 3 ranges: 

• Mandatory quantities (0x0000 - 0x3FFF). 

C16-5: Mandatory quantities for performance sampling shall be imple- 
mented on all ports of all nodes. 

• Optional quantities (0x4000 - OxBFFF). 

0I6-I: If provided, optional quantities for performance sampling shall be 
implemented as described. 

• Vendor quantities (OxCOOO - OxFFFF). Vendors may define and im- 
plement their own quantities in this range 
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Table 187 SampleSelect Values 


Sample Select Value 


Name 


Description 


Mandatory Quantities 


0x0000 


Reserved 


Reserved 


0x0001 


PortXmitData 


Total number of data octets, divided by 4, transmitted 
on all VLs during the sampling Interval from the port 
selected by PortSelect. This includes all octets 
between (and not including) the start of packet delim- 
iter and VCRC. It excludes all link packets. 
Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest arouo oossible Results are still reoorted 
in units of four octets. 


0x0002 


PortRcvData 


Total number of data octets, divided by 4, received on 
all VLs during the sampling interval on the port 
selected bv PortSelect This includes all octets 
between (and not including) the start of packet delim- 
iter and VCRC. It excludes all link packets. 
Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
in units of four octets. 


0x0003 


PortXmitPkts 


Total number of packets, excluding link packets, 
transmitted on all VLs during the sampling interval 
from the port selected by PortSelect. 


0x0004 


PortRcvPkts 


Total number of packets, including packets containing 
errors and excluding link packets, received on all VLs 
during the sampling Interval on the port selected by 
PortSelect. 


0x0005 


PortXmltWait 


The number of ticks during which the port selected by 
PortSelect had data to transmit but no data was sent 
during the entire tick either because of insufficient 
credits or because of lack of arbitration. 


0x0007-0x3FFF 


Reserved 


Reserved. Result of sampling is all zeroes. 



Optional InfiniBand Quantities 
All quantities tliat are available in optional attributes as running counters are also optionally available for 
sampling over a given period. Eacti sampling counter corresponding to an optional running counter is reset 
to zero for each sample and increments along with the selected running counter during the sampling interval. 

For certain quantities, such as PortXmitQueue[nl, there are no corresponding optional attributes. 
Bits 5 through 0 of a Sample Select Value correspond to the bit number in PortSamplesControhOptionMask. 
All values between 0x4000 and OxBFFF not listed here are reserved and the result of sampling is all zeroes 
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Table 187 SampleSelect Values 1 



Sample Select Value 


Name 


Description 


2 
3 


0x4n00 


PortXmitQueue[n] 


Contains the transmit queue depth in bytes on VL "n" 
of the port selected by PortSelect at the time the 
SampleStart timer expired 

The goal of measuring queue depths is to enable soft- 
ware to compute the average time data waits for 
transmission inside a node. Ideally, a node should 
increment a counter upon arrival of each byte that is 
destined for a given output port and should decre- 
ment the counter upon departure of each byte from 
the output port. In practice, this will be impossible to 
implement precisely. Implementers are encouraged to 
measure queue depths as accurately as practical and 
to document any systematic measurement errors. 
Note that an implementation can compensate for an 
inherent delay in accounting for arriving bytes by 
introducing an equal delay in accounting for departing 
bytes 


4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 


0x4n01 


PortXmitDataVL[n] 


Total number of data octets, divided by 4, transmitted 
on VL "n" from the port selected by PortSelect. This 
includes all octets between the start of packet and 
end of packet delimiters. It excludes all control groups 
and VCRCs. 

Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
as a multiple of four octets 


17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 


0x4n02 


PortRcvDataVL[n] 


Total number of data octets, divided by 4, received on 
input VL "n" on the port selected by PortSelect. This 
includes all octets between the start of packet and 

xSllyj Vi pdUKcl Uclllililcro. 11 cAUlUUcb all uUlUlUI giUUpo 

and VCRCs. 

Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
as a multiple of four octets 


0x4 n03 


PortXmitPktVL[n] 


Total number of packets transmitted on VL "n" from 
the port selected by PortSelect with or without errors. 


32 
33 


UX4nU4 


rOrtKCVrKtVL[nJ 


Total number of packets received on input VL "n" from 
the port selected by PortSelect with or without errors. 


34 
35 


0x4005 


PortRcvErrorDetaiis: 
PortLocalPhysicalErrors 


See Table 192 PortRcvErrorDetaiis on oaae 737. 


36 
37 


0x4006 


PortRcvErrorDetaiis: 

PortMalformedPacket- 

Errors 


See Table 192 PortRcvErrorDetaiis on oaae 737. 


38 
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Table 187 SampleSelect Values 



Sample Select Value 


Name 


Description 


0x4007 


PortRcvErrorDetaiis: 

PortBufferOven-unEr- 

rors 


See Table 192 PortRcvErrorDetaiis on oaae 737. 


0x4008 


PortRcvErrorDetaiis: 
PortDLIDMappingErrors 


See Table 192 PortRcvErrorDetaiis on oaae 737. 


0x4009 


PortRcvErrorDetaiis: 
PortVLMappingErrors 


See Table 192 PortRcvErrorDetaiis on oaae 737. 


0x400A 


PortRcvErrorDetaiis: 
PortLoopingErrors 


See Table 192 PortRcvErrorDetaiis on oaae 737. 


0x400B 


PortXmitDiscardDe- 
tails: PortlnactiveDis- 
cards 


See Table 193 PortXmitDiscardDetails on oaae 738. 


0x400C 


PortXmitDiscardDe- 
tails: PortNeighborM- 
TU Discards 


See Table 193 PortXmitDiscardDetails on oaae 738. 


0x400D 


PortXmitDiscardDe- 
tails: PortSwLifetime- 
LimitDiscards 


See Table 193 PortXmitDiscardDetails on oaae 738. 


0x400E 


PortXmitDiscardDe- 
tails: PortSwHOQLimit- 
Discards 


See Table 193 PortXmitDiscardDetails on oaae 738. 


0x400F 


PortOpRcvCounters: 
PortOpRcvPkts 


See Table 194 PortOoRcvCounters on oaae 738. The 
op code to be sampled is selected by PortSamples- 
Control: OpCode. 


0x4010 


PortOpRcvCounters: 
PortOpRcvData 


See Table 194 PortOoRcvCounters on oaae 738. The 
op code to be sampled is selected by PortSamples- 
Control: OpCode. 


0x4011 


PortFlowCtlCounters: 
PortXmitFlowPkts 


See Table 195 PortFlowCtlCounters on oaae 739. 


0x4012 


PortFlowCtlCounters: 
PortRcvFlowPkts 


See Table 195 PortFlowCtlCounters on oaae 739 


0x4n13 


PortVLOpPackets: 
PortVLOpPackets[n] 


See Table 196 PortVLOoPackets on paae 740. The 
op code to be sampled is selected by PortSamples- 
Control: OpCode 


0x4n14 


PortVLOpData: PortV- 
LOpData[n] 


See Table 197 PortVLOoData on oaae 742. The op 
code to be sampled is selected by PortSamplesCon- 
trol: OpCode 


0x4n15 


PortVLXmitFlowCtlUp- 
dateErrors: PortVLXmit- 
FlowCtlUpdateErrors[n] 


See Table 198 PortVLXmltFIowCtiUodateErrors on 
oaae 743. 
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Table 187 SampleSelect Values 



Sample Select Value 


Name 


Description 


0x4n16 


PortVLXmitWait- 
Counters: PortVLXmit- 
Wait[n] 


See Table 199 PortVLXmitWaitCounters on oaae 745 


0x4n30 


SwPortVLUnkDests: 
PortVLUnkDests[n] 


See Table 200 SwPortVLConcestion on oaae 747 


Vendor-Defined Quantities 


OxCOOO-OxFFFF 


Reserved 


Reserved for vendor-specific counters 



16,1.3.4 PortSamplesResult 



This mandatory attribute reports the results of a particular sample con- 
trolled and initiated via the PortSamplesControl attribute. 

Table 188 PortSamplesResult 



Component 


Access 


Length (bits) 


Description 


Tag 


RO 


16 


Read-only copy of PortSamplesControhTag. 
The Tag mechanism provides a means for perfor- 
mance applications to detect collisions when using 
the sampling mechanism. After successfully Initiating 
a sample run, an application should wait until the 
sample should have completed, then repeat a Perfor- 
manceGet (PortSamplesResult) until SampleStatus Is 
zero. If after any Get the Tag value in the result does 
not match the value set by the application at the start 
of the run, another application has already started a 
new sample. In this case the first application must 
wait for a suitable time and retry Its sample 


Reserved 


RO 


14 


Reserved, shall be zero. 


SampleStatus 


RO 


2 


Read-only copy of PortSamplesControl:SampleSta- 
tus. Provided here to minimize traffic while application 
Is polling for sample completion 


CounterO 


RO 


32 


Mandatory counter. When PortSamplesControhSam- 
pleStatus is zero, contains the result of sampling the 
quantity selected by PortSamplesCon- 
trohCounterSelectO. Undefined when PortSamples- 
Control :SampleStatus Is non-zero. The actual number 
of valid (least significant) bits In the counter Is defined 
by PortSamplesControhCounterWidth 


Counterl 


RO 


32 


Optional counter. All zeroes If not implemented; other- 
wise similar to CounterO. Contains the result of sam- 
pling the quantity selected by 
PortSamplesControl:Counter1Select 
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Table 188 PortSamplesResult 



Component 


Access 


Length (bits) 


Description 


Counter2 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter2Select 


Counters 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
troI:Counter3Select 


Counter4 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter4Select 


Counters 


RO 


32 


Similar to Counterl ; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trohCounterSSelect 


Counters 


RO 


32 


Similar to Counterl ; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter6Select 


Counter? 


RO 


32 


Similar to Counterl ; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trohCounterZSelect 


Counters 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter8Select 


Counters 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter9Se!ect 


CounterlO 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter10Select 


Counter11 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter11 Select 


Counter12 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter1 2Se!ect 


Counter13 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter13SeIect 


Counter14 


RO 


32 


Similar to Counterl; contains the result of sampling 
the quantity selected by PortSamplesCon- 
trol:Counter14SeIect 
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16.1,3.5 PortCounters 



C16-6: The PortCounters attribute of the Performance class is mandatory. 

It provides basic performance and exception statistics for a port. 

C16-7: All counters shall be initialized to zero; instead of overflowing they 
shall stop at all ones and shall be reset by a management application. 

Note that although PortCounters is mandatory, it contains components 
that are optional. 

Table 189 PortCounters 



Component 


Access 


Length (bits) 


Description 


Resen/ed 


RO 


8 


Reserved, shall be zero. 


PortSelect 


RW 


8 


Selects the port for which the data is reported. Select- 
ing a non-existent port results in all zeroes. 
If gathering data from alt ports at once is supported 
(see Table 185 Performance Manacement ClassPort- 
lnfo:Capabi!itvMask on oaae 720). settino PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 


CounterSelect 


RW 


16 


When writing (Set), selects which counters are over- 



written by the values specified in their respective 

fields. When reading (Get), this is ignored. 

Bit 0 - SymbolErrorCounter 

Bit 1 - LinkErrorRecoveryCounter 

Bit 2 - LinkDownedCounter 

Bit 3 - PortRcvErrors 

Bit 4 - PortRcvRemotePhysicalErrors 

Bit 5 - PortRcvSwitchRelayErrors 

Bit 6 - PortXmitDiscards 

Bit 7 - PortXmitConstraintErrors 

Bit 8 - PortRcvConstraintErrors 

Bit 9 - LocalLinklntegrityEn^ors 

Bit 10 - ExcessiveBufferOverrunErrors 

Bit 11 - VL15Dropped 

Bit 12-PortXmitData 

Bit 13-PortRcvData 

Bit14-PortXmitPkts 

Bit 15-PortRcvPkts 



SymbolErrorCounter 


RW 


16 


Total number of symbol errors detected on one or 
more lanes. Refer to Volume 2. 


LinkErrorRecovery- 


RW 


8 


Total number of times the Port Training state machine 


Counter 






has successfully completed the link error recovery 








process. Refer to Volume 2. 
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Table 189 PortCounters 



Component 


Access 


Length (bits) 


Description 


LinkDownedCounter 


RW 


8 


Total number of times the Port Training state machine 
has failed the link error recovery process and downed 
the link. Refer to Volume 2. 


PortRcvErrors 


RW 


16 


Total number of packets containing an error that were 
received on the port. These errors include: 

- Local physical errors (ICRC, VCRC. FCCRC, and all 
physical errors that cause entry into the BAD 
PACKET or BAD PACKET DISCARD states of the 
packet receiver state machine) 

- Malformed data packet errors (LVer, length, VL) 

- Malformed link packet errors (operand, length, VL) 

- Packets discarded due to buffer overrun 


PortRcvRemotePhysi- 
calErrors 


RW 


16 


Total number of packets marked with the EBP delim- 
iter received on the port. 


PortRcvSwitchRelayEr- 
rors 


RW 


16 


Total number of packets received on the port that 
were discarded because they could not be forwarded 
by the switch relay. Reasons for this include: 

- DLID mapping 

- VL mapping 

- looping (output port = input port) 


PortXmitDiscards 


RW 


16 


Total number of outbound packets discarded by the 
port because the port is down or congested. Reasons 
for this include: 

- output port is in the Inactive state 

- packet length exceeded neighbor MTU 

- switch lifetime limit exceeded 

\^WW I i V II III V lllllv lllllll V /\ W V W Vtfl w VI 

- switch HOQ limit exceeded 


PortXmitConstraintEr- 
rors 


RW 


8 


Total number of packets not transmitted from the port 
for the following reasons: 

- FilterRawOutbound is true and packet Is raw 

- PartitionEnforcementOutbound is true and packet 
fails partition key check, IP version check, or transport 
header version check 


PortRcvConstraintEr- 
rors 


RW 


8 


Total number of packets received on the port that are 
discarded for the following reasons: 

- FilterRawlnbound is true and packet Is raw 

- PartitlonEnforcementlnbound Is true and packet fails 
partition key check, IP version check, or transport 
header version check 


Reserved 1 


RO 


8 


Reserved 
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Table 189 PortCounters 



Component 


Access 


Length (bits) 


Description 


LocalLinklntegrityErrors 


RW 


4 


The number of times that the frequency of packets 
containing local physical errors exceeded 
local Dhv errors, see Table 127 Portlnfo on paae 
634. 


ExcessiveBufferOver- 
runErrors 


RW 


4 


The number of times that overrun_errors consecutive 
flow control update periods occurred with at least one 
overrun error in each period, see Table 127 Portlnfo 
on oaae 634. 


Reserved2 


RO 


16 


Reserved 


VL15Dropped 


RW 


16 


Number of incoming VL15 packets dropped due to 
resource limitations on port selected by PortSelect 
(due to lack of buffers) 


PnrtXmitnata 


RW 




Ontional" chall ho 7orrt if not imnlomontoH Total niim- 

W|JllLII)al, Ol lOII UC? £.XS\\J II IIUl II 1 ipiCII ICI llCU. lUldl IIUIII" 

ber of data octets, divided by 4, transmitted on all VLs 
from the port selected by PortSelect. This includes all 
octets between (and not including) the start of packet 
delimiter and VCRC. It excludes all link packets. 
Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
as a multiple of four octets. 


PnrtRrunflta 
irxLrVL/cila 


RW 
r\v V 




ik^puuiidi, ondM L)6 £M\\j II Hui impiemenicu. luiai num- 
ber of data octets, divided by 4, received on all VLs on 
the port selected by PortSelect. This includes all 
octets between (and not including) the start of packet 
delimiter and VCRC. It excludes all link packets. 
Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
as a multiple of four octets. 


PortXmltPkts 


RW 


32 


Optional; shall be zero if not implemented. Total num- 
ber of packets, excluding link packets, transmitted on 
all VLs from the port. 


PortRcvPkts 


RW 


32 


Optional; shall be zero if not implemented. Total num- 
ber of packets, including packets containing errors 
and excluding link packets, received from all VLs on 
the port. 



16.1.3.6 Mandatory Performance Attribute Use Model 



The PortCounters information is used in the standard manner using Per- 
formanceGetO to read it and PerformanceSet() to reset the counters. 

PortSamplesControl and PortSamplesResult are used together to sample 
one or more quantities over a specified period of time: 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^'^ Trade Association 



Page 734 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



General Services 



October 24. 2000 
FINAL 



10 
11 



The application can first determine the node's sannpling capabilities via a 1 
PerformanceGet(PortSamplesControl). This will return the number and 2 
width of available counters, the quantities that can be sampled, and the 3 
basic time interval (tick). From these the application can compute the ^ 
maximum sample interval that will not cause counter overflow. 

5 

C16-8: To initiate a sample, the application shall do the following: 6 

7 

• Select a random value for SampleStart. The SampleStart timer allows 8 
a performance application to randomize the sample start time and in- 9 
sure decoupling from node or network events. Values used will typi 
cally be 10's of milliseconds. 

Select a random Tag value. This value is used to detect collisions 12 
among multiple independent performance applications accessing the ^ ^ 
same node 

14 

• Select a Samplelnterval value, the quantities to be sampled, and the ^ ^ 
counter that will be assigned to count each quantity. 

Do a PerformanceSet (PortSamplesControl). If the returned value of 
Tag does not match the selected value, another application is using 
the sampling mechanism. In this case the first application must wait 
for a suitable time and retry the PerformanceSet(). 

• Once the sample has been successfully started, the application 
should wait until the SampleStart and Samplelnterval timers should 
have expired, then repeat PerformanceGet (PortSamplesResult) until 
SampleStatus is zero. If at any time the returned Tag value no longer 23 
matches the application's chosen value, regardless of SampleStatus, 24 
it means another application has gained control of the sampling 25 
mechanism. In this case the first application must restart the sam- 26 
pling process. 27 

If more than one set of sampling mechanisms is implemented, the addi- 28 
tional ones are addressed using non-zero Attribute Modifier values. The 29 
previous use model applies to each pair of PortSamplesControl and Port- 
SamplesResult, treating each as an independent entity. 

3 1 

16.1.4 Optional Attributes 32 

Performance Management defines the optional Attributes and Attribute 

Modifier use summarized in Table 190 Optional Performance Manage- 34 

ment Attributes . They are described in detail in the sections following the 35 

table. All quantities in these attributes can also be sampled via the PortS- 36 

amplesControl mechanism. The optional Attributes available are reflected 37 

by the OptionMask in the PortSamplesControl attribute. 2g 

39 
40 
41 
42 
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o16-2: All counters within these optional Performance Attributes shall be 
initialized to zero; instead of overflowing they shall stop at all ones and 
shall be reset by a management application. 

Table 190 Optional Performance Management Attributes 



Attnbute Name 


Attribute 
ID 


Attribute 
Modifier 


Description 


Length 


PortRcvErrorDetails 


0x15 


0x00000000 


Port Detailed Error Counters 


16 bytes 


PortXmitDiscardDetails 


0x16 


0x00000000 


Port Transmit Discard Counters 


12 bytes 


PortOpRcvCounters 


0x17 


0x00000000 


Port Receive Counters per Op Code 


12 bytes 


PortFlowCtlCounters 


0x18 


0x00000000 


Port Flow Control Counters 


12 bytes 


PortVLOp Packets 


0x19 


0x00000000 


Port Packets Received per Op Code per VL 


36 bytes 


PortVLOpData 


OxIA 


0x00000000 


Port Kilobytes Received per Op Code per VL 


68 bytes 


PortVLXmitFlowCtlUpdateErrors 


0x1 B 


0x00000000 


Port Flow Control update errors per VL 


8 bytes 


PortVLXmitWaitCounters 


0x1 C 


0x00000000 


Port Ticks Waiting to Transmit Counters per VL 


36 bytes 


SwPortVLCongestion 


0x30 


0x00000000 


Switch Port Congestion per VL 


36 bytes 



Table 191 Optional Performance Management Attribute / 

Method Map 



Attribute Name 


PerformanceGet 


PerformanceSet 


PortRcvErrorDetails 


X 


X 


PortXmitDiscardDetails 


X 


X 


PortOpRcvCounters 


X 


X 


PortFlowCtlCounters 


X 


X 


PortVLOpPackets 


X 


X 


PortVLOpData 


X 


X 


PortVLXmitFlowCtlUpdateErrors 


X 


X 


PortVLXmitWaitCounters 


X 


X 


SwPortVLCongestion 


X 


X 
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16.1.4.1 PortRcvErrorDetails 



Table 192 PortRcvErrorDetails 



Component 


Access 


Length (bits) 


Description 


Reserved 


RO 


8 


Reserved, shall be zero. 


PortSelect 


RW 


8 


Selects the port for which the statistics are reported. 
Statistics are accumulated for all VLs on a port. 
Selecting a non-existent port results in alt zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Manaaement ClassPort- 
lnfo:CaDabilitvMask on oaae 720). settino PortSelect 
to OxFF will cause data from all ports to be accumu- 
lateo. 


CounterSelect 


RW 


16 


When writing (Set), selects which counters are over- 
written by the values specified in their respective 
fields. When reading (Get), this is ignored. 
Bit 0 - PortLocalPhysicalErrors 
Bit 1 - PortMalformedPacketErrors 
Bit 2 - PortBufferOverrunErrors 
Bit 3 - PortDLIDMappingErrors 
Bit 4 - PortVLMappingErrors 
bit o - rortLoopingErrors 
Bits 6 to 15 - Reserved 


PortLocalPhysicalErrors 


RW 


16 


Total number of packets received on the port that con- 
tain local physical errors (ICRC, VCRC, FCCRC, and 
all physical errors that cause entry into the BAD 
PACKET or BAD PACKET DISCARD states of the 
packet receiver state machine). 


PortMalformedPacket- 
Errors 


RW 


16 


Total number of packets received on the port that con- 
tain malformed packet enters 

- data packets: LVer, length. VL 

- link packets: operand, length. VL 


PortBufferOverrunEr- 
rors 


RW 


16 


Total number of packets received on the port that 
were discarded due to buffer overrun. 


PortDLIDMappingErrors 


RW 


16 


Total number of packets received on the port that 
were discarded because they could not be fonwarded 
by the switch relay due to DLID mapping errors. 


PortVLMappingErrors 


RW 


16 


Total number of packets received on the port that 
were discarded because they could not be forwarded 
by the switch relay due to VL mapping errors. 


PortLooping Errors 


RW 


16 


Total number of packets received on the port that 
were discarded because they could not be forwarded 
by the switch relay due to looping errors (output port = 
input port). 
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16.1.4.2 PortXmitDiscardDetails 



Table 193 PortXmitDiscardDetails 



Component 


Access 


Length (bits) 


Description 


Reserved 


RO 


8 


Reserved, shall be zero. 


PortSelect 


RW 


8 


Selects the port for which the statistics are reported, 
statistics are accumulated for all VLs on a port. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Manaaement ClassPort- 
lnfo:CaDabilitvMask on oaoe 720). setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 


CounterSelect 


RW 


16 


When writing (Set), selects which counters are over- 
written by the values specified in their respective 
fields. When reading (Get), this is ignored. 
Bit 0 - PortlnactiveDiscards 
Bit 1 - PortNeighborMTUDiscards 
Bit 2 - PortSwLifetimeLimitDiscards 
Bit 3 - PortSwHOQLimitDlscards 
Bits 4 to 15 - Reserved 


Port In active Discards 


RW 


16 


Total number of outbound packets discarded by the 
port because it is in the inactive state. 


PortNeighborMTUDis- 
cards 


RW 


16 


Total number of outbound packets discarded by the 
port because packet length exceeded the neighbor 
MTU. 


PortSwLifetimeLimitDls- 
cards 


RW 


16 


Total number of outbound packets discarded by the 
port because the switch lifetime limit was exceeded. 
Applies to switches only. 


PortSwHOQLimitDis- 
cards 


RW 


16 


Total number of outbound packets discarded by the 
port because the switch HOQ lifetime was exceeded. 
Applies to switches only. 


1.4.3 PortOpRcvCounters 








Table 194 PortOpRcvCounters 


Component 


Access 


Length (bits) 


Description 


OpCode 


RW 


8 


Selects the op code (as found in BTH) for which the 
statistics are reported. OxFF means all op codes. 
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Table 194 PortOpRcvCounters 



Component 



Access 



Length (bits) 



Description 



PortSelect 



RW 



8 Selects the port for which the statistics are reported. 

Statistics are accumulated for all VLs on a port. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Management ClassPort- 
InfoiCapabilitvMask on page 720 ). setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 



CounterSelect 



RW 



16 When writing (Set), selects which counters are over- 

written by the values specified in their respective 
fields. When reading (Get), this is ignored. 
Bit 0 - PortOpRcvPkts 
Bit 1 - PortOpRcvData 
Bits 2 to 15 - Reserved 



PortOpRcvPkts 



RW 



32 Total number of packets received without error on the 

port selected by PortSelect containing the opcode 
selected by OpCode. 



PortOpRcvData 



RW 



32 Total number of data octets, divided by 4, received 

without error on all VLs from the port selected by Port- 
Select containing the opcode selected by OpCode. 
This includes all octets between (and not including) 
the start of packet delimiter and VCRC. It excludes all 
link packets. 

Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
as a multiple of four octets. 



16.1.4.4 PortFlowCtlCounters 



Table 195 PortFlowCtlCounters 



Component 


Access 


Length (bits) 


Description 


Reserved 


RO 


8 


Reserved, shall be zero. 



PortSelect 



RW 



Selects the port for which the statistics are reported. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Management ClassPort- 
InforCaoabilityMask on pace 720 V setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 
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Table 195 PortFlowCtlCounters 



Confiponent 


Access 


Length (bits) 


Description 


CounterSelect 


RW 


16 


When writing (Set), selects which counters are over- 
written by the values specified in their respective 
fields. When reading (Get), this is ignored. 

Dli U - rOriAmilrlOWKKIS 

Bit 1 - PortRcvFlowPkts 
Bits 2 to 15 -Reserved 


PortXmitFlow- 
Pkts 


RW 


32 


Total number of flow control packets transmitted on 
the port selected by PortSelect 


PortRcvFlowP- 
kts 


RW 


32 


Total number of flow control packets received on the 
port selected by PortSelect 



16.1.4.5 PortVLOpPackets 



Table 196 PortVLOpPackets 



Component 


Access 


Length (bits) Description 


Opcode 


RW 


8 Selects the op code (as found in BTH) for which the 
statistics are reported. OxFF means all op codes. 



PortSelect 



RW 



Selects the port for which the statistics are reported. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Manaoement ClassPort- 
InfoiCaoabilityMask on page 720 ). setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 
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Table 196 PortVLOpPackets 
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Component 



CounterSelect 



Access 



Length (bits) 



Description 



RW 



16 When writing (Set), selects which counters are over- 

written by the values specified in their respective 
fields. When reading (Get), this is ignored. 
Bit 0 - PortVLOpPacketsO 
Bit 1 - PortVLOpPacketsI 
Bit 2 - PortVLOpPackets2 
Bit 3 - PortVLOpPackets3 
Bit 4 - PortVLOpPackets4 
Bit 5 - PortVLOpPacketsS 
Bit 6 - PortVLOpPacketsS 
Bit 7 - PortVLOpPackets7 
Bit 8 - PortVLOpPacketsS 
Bit 9 - PortVLOpPackets9 
Bit 10 - PortVLOpPacketsI 0 
Bit 11 - PortVLOpPackets1 1 
Bit 12 - PortVLOpPacketsI 2 
Bit 13 - PortVLOpPacketsI 3 
Bit 14 - PortVLOpPackets14 
Bit 15 - PortVLOpPacketsI 5 



PortVLOpPacketsO 


RW 


16 


The total number of packets received without error on 
VL 0 of the port selected by PortSetect containing the 
opcode selected by OpCode 


PortVLOpPacketsI 


RW 


16 


Similar count for VL 1 


PortVLOpPackets2 


RW 


16 


Similar count for VL 2 


PortVLOpPacketsS 


RW 


16 


Similar count for VL 3 


PortVLOpPackets4 


RW 


16 


Similar count for VL 4 


PortVLOpPacketsS 


RW 


16 


Similar count for VL 5 


PortVLOpPacketsS 


RW 


16 


Similar count for VL 6 


PortVLOpPackets/ 


RW 


16 


Similar count for VL 7 


PortVLOpPacketsS 


RW 


16 


Similar count for VL 8 


PortVLOpPacketsS 


RW 


16 


Similar count for VL 9 


PortVLOpPackets 10 


RW 


16 


Similar count for VL 10 


PortVLOpPackets1 1 


RW 


16 


Similar count for VL 11 


PortVLOpPacketsI 2 


RW 


16 


Similar count for VL 12 


PortVLOpPacketsI 3 


RW 


16 


Similar count for VL 13 


PortVLOpPacketsM 


RW 


16 


Similar count for VL 14 


PortVLOpPackets 15 


RW 


16 


Similar count for VL 15 
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16.1.4.6 PortVLOpData 



Table 197 PortVLOpData 



Component 


Access 


Length (bits) 


Description 


OpCode 


RW 


8 


Selects the op code (as found In BTH) for which the 
statistics are reported. OxFF means all op codes. 


PortSelect 


RW 


8 


Selects the port for which the statistics are reported. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Manaoement ClassPort- 
InfoiCaoabilitvMask on oaae 720). settina PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 


CounterSelect 


RW 


16 


When writing (Set), selects which counters are over- 



written by the values specified in their respective 
fields. When reading (Get), this is ignored. 



BitO- 


PortVLOpDataO 


Bit 1 - 


PortVLOpData 1 


Bit2- 


PortVLOpData2 


Bit3- 


PortVLOpDataS 


Bit4- 


PortVLOpData4 


BitS- 


PortVLOpData5 


Bite- 


PortVLOpDatae 


Bit7- 


PortVLOpData? 


Bit8- 


PortVLOpDataS 


Bit9- 


PortVLOpDataQ 


Bit 10 


-PortVLOpDatalO 


Bit 11 


-PortVLOpData 11 


Bit 12 


-PortVLOpData12 


Bit 13 


-PortVLOpDatalS 


Bit 14 


-PortVLOpData14 


Bit 15 


-PortVLOpData15 



PortVLOpDataO 



RW 



32 Total number of data octets, divided by 4, received 

without error on VL 0 from the port selected by PortS- 
elect containing the opcode selected by OpCode. 
This includes all octets between (and not including) 
the start of packet and VCRC. It excludes all link 
packets. 

Implementers may choose to count data octets in 
groups larger than four but are encouraged to choose 
the smallest group possible. Results are still reported 
as a multiple of four octets 



PortVLOpDatal 


RW 


32 


Similar count for VL 1 


PortVLOpData2 


RW 


32 


Similar count for VL 2 
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Table 197 PortVLOpData 



Component 


Access 


Length (bits) 


Description 


PortVLOpData3 


RW 


32 


Similar count for VL 3 


PortVLOpData4 


RW 


32 


Similar count for VL 4 


PortVLOpDataS 


RW 


32 


Similar count for VL 5 


PortVLOpDatae 


RW 


32 


Similar count for VL 6 


PortVLOpData/ 


RW 


32 


Similar count for VL 7 


PortVLOpDataS 


RW 


32 


Similar count for VL 8 


PortVLOpDataQ 


RW 


32 


Similar count for VL 9 


PortVLOpDatalO 


RW 


32 


Similar count for VL 10 


PortVLOpData11 


RW 


32 


Similar count for VL 11 


PortVLOpData12 


RW 


32 


Similar count for VL 12 


PortVLOpData13 


RW 


32 


Similar count for VL 13 


PortVLOpData14 


RW 


32 


Similar count for VL 14 


PortVLOpData15 


RW 


32 


Similar count for VL 15 



16.1 .4.7 PortVLXmitFlowCtlUpdateerrors 



Table 198 PortVLXmitFlowCtlUpdateErrors 



Component 


Access 


Length (bits) 


Description 


Reserved 


RO 


8 


Reserved, shall be zero. 



PortSelect 



RW 



Selects the port for which the statistics are reported. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Management ClassPort- 
lnfo:CapabllitvMasl< on page 720 ). setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 
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Table 198 PortVLXmitFlowCtlUpdateErrors 



Component 



CounterSelect 



Access 



Length (bits) 



Description 



RW 



16 



When writing (Set), selects which counters are over- 
written by the values specified in their respective 
fields. When reading (Get), this is ignored. 



It 0 - PortVLXmitFlowCtlUpdateErrorsO 
It 1 - PortVLXmitFlowCtlUpdateErrorsI 
1 2 - PortVLXmitFlowCtlUpdateErrors2 
t 3 - PortVLXmitFlowCtlUpdateErrorsS 

1 4 - PortVLXmitFlowCtlUpdateErrors4 

1 5 - PortVLXmitFlowCtlUpdateErrorsS 

1 6 - PortVLXmitFlowCtlUpdateErrorsS 
t 7 - PortVLXmitFlowCtlUpdateErrors? 

1 8 - PortVLXmitFlowCtlUpdateErrorsS 

1 9 - PortVLXmitFlowCtlUpdateErrors9 

it 10 - PortVLXmitFlowCtlUpdateErrorsI 0 

1 11 - PortVLXmitFlowCtlUpdateErrors1 1 

1 12 - PortVLXmitFlowCtlUpdateErrorsI 2 

1 13 - PortVLXmitFlowCtlUpdateErrorsI 3 

1 14 - PortVLXmitFlowCtlUpdateErrorsI 4 

1 15 - PortVLXmitFlowCtlUpdateErrorsI 5 



PortVLXmit- 

FlowCttUpdate 

ErrorsO 


RW 


2 


Total number of flow control update errors on VL 0 on 
the port selected by PortSelect 


PortVLXmit- 

FlowCtlUpdate 

Errorsi 


RW 


2 


Similar count for VL 1 


PortVLXmit- 

FlowCtlUpdate 

Errors2 


RW 


2 


Similar count for VL2 


PortVLXmit- 

FlowCtlUpdate 

Errors3 


RW 


2 


Similar count for VL3 


PortVLXmit- 

FlowCtlUpdate 

Errors4 


RW 


2 


Similar count for VL4 


PortVLXmit- 

FlowCtlUpdate 

ErrorsS 


RW 


2 


Similar count for VL 5 


PortVLXmit- 

FlowCtlUpdate 

Errors6 


RW 


2 


Similar count for VL 6 


PortVLXmit- 

FlowCtlUpdate 

Errors? 


RW 


2 


Similar count for VL 7 
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Table 198 PortVLXmitFlowCtlUpdateErrors 



Component 


Access 


Length (bits) 


Description 


PortVLXmit- 

FlowCtlUpdate 

Errors8 


RW 


2 


Similar count for VL 8 


PortVLXmit- 

FlowCtlUpdate 

Errors9 


RW 


2 


Similar count for VL 9 


PortVLXmit- 

FlowCtlUpdate 

ErrorslO 


RW 


2 


Similar count for VL 10 


PortVLXmit- 

FlowCtlUpdate 

ErrorsH 


RW 


2 


Similar count for VL 11 


PortVLXmit- 

FlowCtlUpdate 

Errors12 


RW 


2 


Similar count for VL 12 


PortVLXmit- 

FlowCtlUpdate 

Errors13 


RW 


2 


Similar count for VL 13 


PortVLXmit- 

FlowCtlUpdate 

Errors14 


RW 


2 


Similar count for VL 14 


PortVLXmit- 

FlowCtlUpdate 

Errors15 


RW 


2 


Similar count for VL 15 



16.1.4.8 PortVLXmitWaitCounters 



Table 199 PortVLXmitWaitCounters 



Component 


Access 


Length (bits) 


Description 


Reserved 


RO 


8 


Reserved, shall be zero. 



PortSelect 



RW 



Selects the port for which the statistics are reported. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once Is supported 
(see Table 185 Performance Manaaement ClassPort- 
lnfo:CapabllltvMask on page 720 \. setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 



1 

2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfiniBand^ Trade Association 



Page 745 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



General Services 



October 24, 2000 
FINAL 



Table 199 PortVLXmitWaitCounters 1 



Component 


Access 


Length (bits) 


Description 


2 
3 


CounterSelect 


RW 


16 


When writing (Set), selects which counters are over- 
written by the values specified in their respective 
fields. When reading (Get), this is ignored. 
Bit 0 - PortVLXmitWaitO 
Bit 1 - PortVLXmitWaitl 
Bit 2 - PortVLXmitWait2 
Bit 3 - PortVLXmitWaitS 
Bit 4 - PortVLXmitWait4 
Bit 5 - PortVLXmitWaitS 
Bit 6 - PortVLXmitWait6 
Bit 7 - PortVLXmitWait7 
Bit 8 - PortVLXmitWaitS 
Bit 9 - PortVLXmitWaitS 
Bit 10- PortVLXmitWaitl 0 
Bit 11 - PortVLXmitWaitl 1 
Bit 12-PortVLXmitWait12 
Bit13-PortVLXmitWait13 
Bit14-PortVLXmitWait14 
BitlS-PortVLXmitWaitIS 


4 

5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 

18 

1 C7 

20 
91 

22 


PortVLXmitWa 
itO 


RW 


16 


Total number of ticks during which the port selected 
by PortSelect had data to transmit on VLO but no data 
was sent during the entire tick either because of Insuf- 
ficient credits or because of lack of arbitration. 


PortVLXmitWa 
it1 


RW 


16 


Similar count for VL 1 


24 

£.0 


PortVLXmitWa 
it2 


RW 


16 


Similar count for VL2 


26 

c, f 


PortVLXmitWa 
it3 


RW 


16 


Similar count for VL3 


28 


PortVLXmitWa 
it4 


RW 


16 


Similar count for VL 4 


30 

0 1 


PortVLXmitWa 
its 


RW 


16 


Similar count for VL 5 


32 


PortVLXmitWa 
it6 


RW 


16 


Similar count for VL6 


34 


PortVLXmitWa 
it7 


RW 


16 


Similar count for VL 7 


36 
37 


PortVLXmitWa 
its 


RW 


16 


Similar count for VL 8 


38 
39 


PortVLXmitWa 
it9 


RW 


16 


Similar count for VL9 


40 
41 



42 
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Table 199 PortVLXmitWaitCounters 



Component 


Access 


Length (bits) 


Description 


PortVLXmitWa 
it10 


RW 


16 


Similar count for VL 10 


PortVLXmitWa 
it11 


RW 


16 


Similar count for VL 11 


PortVLXmitWa 
it12 


RW 


16 


Similar count for.VL 12 


PortVLXmitWa 
it13 


RW 


16 


Similar count for VL 13 


PortVLXmitWa 
it14 


RW 


16 


Similar count for VL 14 


PortVLXmitWa 
it15 


RW 


16 


Similar count for VL 15 



16.1.4.9 SwPortVLCongestion 



Unlike the rest of the performance attributes described in this chapter, 
which apply to all node types, this optional attribute applies only to 
Switches. 

Table 200 SwPortVLCongestion 



Component 


Access 


Length (bits) 


Description 


Reserved 


RO 


8 


Reserved, shall be zero. 



PortSelect 



RW 



Selects the port for which the statistics are reported. 
Selecting a non-existent port results in all zeroes. 
If gathering data from all ports at once is supported 
(see Table 185 Performance Management ClassPort- 
lnfo:CapabilitvMask on page 720) . setting PortSelect 
to OxFF will cause data from all ports to be accumu- 
lated. 
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Table 200 SwPortVLCongestion 



1 
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10 
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Component 



CounterSelect 



Access 



Length (bits) 



Description 



RW 



16 



When writing (Set), selects which counters are over- 
written by the values specified in their respective 
fields. When reading (Get), this is ignored. 



it 0 - SWPortVLCongestionO 
it 1 - SWPortVLCongestionl 
it 2 - SWPortVLCongestion2 
it 3 - SWPortVLCongestion3 
it 4 - SWPortVLCongestion4 
it 5 - SWPortVLCongestlonS 
It 6 - SWPortVLCongestione 
it 7 - SWPortVLCongestion7 
1 8 - SWPortVLCongestionS 
It 9 - SWPortVLCongestion9 
it 10 - SWPortVLCongestionl 0 
1 11 - SWPortVLCongestionl 1 
it 12 - SWPortVLCongestionl 2 
it 13 - SWPortVLCongestionl 3 
it 14 - SWPortVLCongestion14 
it 15 - SWPortVLCongestionl 5 



SWPortVLCongestionO 


RW 


16 


Total number of packets to be transmitted on VL 0 of 
the output port selected by PortSelect that were dis- 
carded because of congestion. This includes the fol- 
lowing reasons: 

- Switch lifetime limit exceeded 

- Switch HOQ limit exceeded 


SWPortVLCongestionl 


RW 


16 


Similar count for VL 1. 


SWPortVLCongestion2 


RW 


16 


Similar count for VL 2. 


SWPortVLCongestion3 


RW 


16 


Similar count for VL 3. 


SWPortVLCongestion4 


RW 


16 


Similar count for VL 4. 


SWPortVLCongestionS 


RW 


16 


Similar count for VL 5. 


SWPortVLCongestione 


RW 


16 


Similar count for VL6 


SWPortVLCongestion7 


RW 


16 


Similar count for VL 7 


SWPortVLCongestione 


RW 


16 


Similar count for VL 8 


SWPortVLCongestion9 


RW 


16 


Similar count for VL 9 


SWPortVLCongestion 1 
0 


RW 


16 


Similar count for VL 10 


SWPortVLCongestion 1 
1 


RW 


16 


Similar count for VL 11 
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Table 200 SwPortVLCongestion 



Component 


Access 


Length (bits) 


Description 


SWPortVLCongestion 1 
2 


RW 


16 


Similar count for VL 12 


SWPortVLCongestionl 
3 


RW 


16 


Similar count for VL 13 


SWPortVLCongestion 1 
4 


RW 


16 


Similar count for VL 14 


SWPortVLCongestion 1 
5 


RW 


16 


Simitar count for VL 15 



16.2 Baseboard Management 



CI 6-9: The Baseboard Management Agent is mandatory on all nodes. 

This section describes the Management Datagrams used to transport 
Baseboard Management commands across the fabric. For more informa- 
tion regarding Hardware Management of IB Modules, non-Modules (IB 
devices whose packaging are different from an IB Module form factor), 
and Chassis, see InfiniBand Architecture Specification, Volume 2, 
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Chapter "Hardware Management". A simplified overview is presented 
here. 
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Baseboard Managed 
Unit, which can be: 

• Add-in Board 
Managed Chassis 

• Other I/O Unit 

TCA (if an Add-in 
Board or I/O Unit) 

Switch (if a Man- 
aged Chassis) 




\ Infiniband Subnet 
\ 

\ 

\ 



Note: Drawing not to 




• GSI (QP1) 



QPO 



BMA 



MME 



IB-ML 



S H 

I 

OQ 



SMA 



power uontroi 
Temp Sensors 



MoauieinTo 



CnassisinTo 



x:me- 



Figure 167 Baseboard Management Architecture 



The SM and SMA are shown in Figure 167 strictly to indicate that Base- 
board Management and Subnet Management are independent, separate 
entities in the fabric providing non-overlapping functionality. 

The Baseboard Manager (BM) is a software entity that manages the hard- 
ware via Baseboard Management messages. From the BM, these mes- 
sages are tunneled (encapsulated in MADs) through the IBA fabric to the 
Baseboard Management Agent (BMA), which then recognizes the mes- 
sage and forwards it to the Module Management Entity (MME). The MME 
processes the embedded Baseboard Management commands. In some 
cases, this results in the MME generating corresponding messages and 
transactions on a present InfiniBand Management Link (IB-ML). IB-ML 
messages may interface with a present Chassis Management Entity 
(CME). 
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16.2.1 MAD Format 



The BM may use the Subnet Administration Interface to retrieve informa- 
tion regarding the nodes discovered in the IBA subnet, such as ad- 
dresses, capabilities, types. 

The Subnet Administration Interface provides basic discovery information 
that is common to all IBA endnodes, regardless of the type of Endnode in 
the subnet as described above. It does not provide information beyond 
this, such as VPD, chassis management data, and any other information 
under Baseboard Management control. This information is "discovered" 
through baseboard management. 

A Baseboard Managed Unit can be either an IB-Module as defined in 
Volume 2, a form factor other than defined in Volume 2 (a non-Module), or 
a Managed Chassis. Protocol-aware IB-Modules handle the send/receive 
of the Baseboard Management MADs. The MADs are addressed using 
the LID of any port of the IB device on the Module. A Managed Chassis 
may contain a switch which handles the send/receive of the Baseboard 
Management MADs. The MADs are addressed using the LID of the 
switch. 



01 6-10: The datagrams in the Baseboard management class shall con- 
form to the MAD format and use as specified in 13.4 Management Data- 
grams on page 573 and further customized in Figure 168 Baseboard 
Management MAD Format on page 751 and Table 201 Baseboard Man- 
agement MAD Fields on page 752 below. 



Figure 168 Baseboard Management MAD Format 
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Figure 168 Baseboard Management MAD Format 
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bytes 



64 



Data 



252 



Table 201 Baseboard Management MAD Fields 



Field Name 


Length 


Description 


Common MAD Header 


24 bytes 


Common MAD as described in 13.4.2 Manaaement Dataaram Format on 
oaae 574 


B_Key 


8 bytes 


BM soecific kev. See 16.2.4 B Kev General Use on oaae 758 for definition 
and use. 


Reserved 


32 bytes 


This field is reserved and shall be set to zeroes. 


Baseboard Management Data 


192 bytes 


Attribute data is mapped bit for bit from the format described in the following 
sections to the start of this data field. If the attribute is smaller than the data 
field, the content of the remainder of the data field is unspecified. 



16.2.1.1 Status Field 



The Status field is described in 13.4.7 Status Field on page 587 , No class- 
specific bits are defined. 

Table 202 Baseboard Management Status Field 



Bits 



Name 



Meaning 



0-7 



Common bits as defined in 13.4.7 Status Field on page 587 



8-15 



Class-specific bits are reserved 



16.2.2 Methods 



The Baseboard Management class uses a subset of the common 
methods described in 13.4.5 Manaaement Class Methods on page 577 . 



Table 203 Baseboard Management Methods 



Method Type 


Value 


Description 


BMGetO 


0x01 


Request a get (read) of an attribute. 


BMSetO 


0x02 


Request a set (write) of an attribute. 


BMGetRespO 


0x81 


Response from a get or set request. 
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Table 203 Baseboard Management Methods 



Method Type 


Value 


Description 


BMSendO 


0x03 


Send Baseboard Management attribute (this can be the attribute 
for an encapsulated Baseboard Management request [com- 
mand] or response.). 


BMTrapO 


0x05 


Notify an event occurred. 


BMTrapRepressO 


0x07 


Block repetition of notification. 



Here are some examples of transactions. Semantics are described in 
Volume 2A, Chapter 9. 



Datagram Transactions 



BM 



BMA 



MME 



BMSendO 
Isb of attribute 
modifier = 0 

BMSendO 

Isb of attribute modi- 
fier = 1 [response] (a 
response is not 
required for all com- 



IBA Fabric 



Perform 
Baseboard 
Management 



BiyiAeMME 



Figure 169 BM Initiated IB-IVIL cmd 



Requests and response capabilities are symmetric. I.e. BMSend() may be 
used to send a request from the MME as well as a response. Similarly, the 
Baseboard Manager may use BMSend to deliver a response. Figure 170 
IB-ML initiated cmd on oaae 754 . illustrates the path that would be taken 
if a CME generated a request to the Baseboard Manager by delivering the 
request via a module's IB-ML -to- IB functionality. 
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Datagram Transactions 



IB- 



BM 



BMA MME ML CME 



BMSendO 



BMSendO 
(a response is not 
required for all com- 
mands) 



IBA Fabric 



16.2.3 Attributes 



Figure 170 IB-ML initiated cmd 



Table 204 Baseboard Management Attributes on page 754 summarizes 
the attributes used by the Baseboard Management class. 



Table 204 Baseboard Management Attributes 



Attribute Name 


Attribute 
ID 


Attribute Modifier^ 


Description 


ClassPortlnfo 


0x0001 


0x00000000 


General and port-specific infor- 
mation for the BM class. 


Notice 


0x0002 


0x00000000 


Information regarding a Trap. 


BKeylnfo 


0x0010 


0x00000000 


B_Key information for the node. 


WriteVPD 


0x0020 


0x00000000/0x00000001 


See Hardware Management 
chapter in Volume 2 for this and 
following attributes. 


ReadVPD 


0x0021 


0x00000000 / 0x00000001 




ResetlBML 


0x0022 


0x00000000 / 0x00000001 




SetModulePMControl 


0x0023 


0x00000000/0x00000001 
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Table 204 Baseboard Management Attributes 



Attribute Name 


Attribute 
ID 


Attribute Modifier^ 


Description 


GetModulePMControl 


0x0024 


0x00000000/0x00000001 




SetUnitPMControl 


0x0025 


0x00000000/0x00000001 




GetUnitPMControl 


0x0026 


0x00000000 / 0x00000001 




SetlOCPMControl 


0x0027 


0x00000000 / 0x00000001 




GetlOCPMControl 


0x0028 


0x00000000 / 0x00000001 




SetModuleState 


0x0029 


0x00000000 / 0x00000001 




SetModuleAttentlon 


0x002A 


0x00000000 / 0x00000001 




GetModuleStatus 


0x0028 


0x00000000/0x00000001 




IB2IBML 


0x002C 


0x00000000 / 0x00000001 




IB2CME 


0x002D 


0x00000000 / 0x00000001 




IB2MME 


OxO02E 


0x00000000 / 0x00000001 




OEM 


0x002F 


0x00000000 / 0x00000001 





a. Where two attribute modifiers are listed, the least significant bit of the attribute Modifier is used 
to identify BMSend() requests (0) fronn responses. Refer to InfiniBand Architecture 
Specification, Volunne 2, Chapter "Hardware Management" for more details. 

Table 205 Baseboard Management Attribute / Method Map 



Attribute Name 


BMGet 


BMSet 


BMSend 


BMTrap 


ClassPortlnfo 


X 


X 






Notice 


X 


X 




X 


BKeylnfo 


X 


X 






WriteVPD 






X 




ReadVPD 






X 




ResetlBML 






X 




SetModuIePMControl 






X 




GetModulePMControl 






X 




SetUnitPMControl 






X 




GetUnitPMControl 






X 




SetlOCPMControl 






X 




GetlOCPMControl 






X 




SetModuleState 






X 
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Table 205 Baseboard Management Attribute / Method Map 



16.2.3.1 ClassPortInfo 



16.2.3.2 Notice 



Attribute Name 


BMGet 


BMSet 


BMSend 


BMTrap 


SetModuleAttention 






X 




GetModuleStatus 






X 




IB2IBML 






X 




IB2CME 






X 




IB2MME 






X 




OEM 






X 





The ClassPortInfo attribute is described in 13.4.8.1 ClassPortInfo on page 
589 . In addition, bit 8 of the CapabilityMask component is defined: 

Table 206 Baseboard Management 
ClassPortlnfo:CapabilityMask 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.8.1 Class- 
PortInfo on oaqe 589 


8 


IslBMLSupported 


Direct Access to IB-ML is supported 


9-15 




Resen/ed 



The Notice attribute is described in 1 3.4.8.2 Notice on page 591 . It is used 
for one optional generic trap. 



Table 207 Baseboard Management Traps 



Name 



Type 



Number 



Data Details 



BKeyViolation Security 259 



BadB_Key, <B_Key> from <LIDADDR>/<GIDADDR>/<QP> attempted 
<METHOD> with <ATTRIBUTEID> and <ATTRIBUTEMODIFIER>. 



Other Baseboard traps are defined in InfiniBand Architecture Specification, Volume 2, Chapter 
"Hardware Management" 



o16-3: Baseboard Management Traps use the following layout for the 
DataDetails component of the Notice attribute, see Table 208 Notice Dat- 
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aPetails For Trap 259 on page 757 . Fields shall be filled with the informa- 
tion corresponding to the description of a given trap. 

Table 208 Notice DataDetails For Trap 259 



Field 


Length(bits) 


Description 


LIDADDR 


16 


Local Identifier 


METHOD 


8 


Method 


Reserved 1 


8 


Shall be filled with zeroes 


ATTRIBUTEID 


16 


Attribute ID 


ATTRIBUTE MODIFIER 


32 


Attribute Modifier 


Reserved2 


8 


Shall be filled with zeroes 


QP 


24 


Queue Pair 


BKEY 


64 


B_Key 


GIDADDR 


128 


Global Identifier. 

If no GRH is present in the offend- 
ing packet, this field shall be filled 
with zeroes. 


Padding 


128 


Shall be ignored on read. Content 
is unspecified. 
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16.2.3.3 BKeyInfo 



Table 209 BKeyInfo 



Component 


Access 


Length(bits) 


Description 


B_Key 


RW 


64 


The 8-byte Baseboard Management key used in all BM 
MADs by ail valid BMs. A value of 0 means no B_Key 
check is ever done by the BMA. 


B_KeyProtectBit 


RW 


1 


See 16.2.4.3 B Kev Ooeration on oaae 759 for details. 


Reserved 


RW 


15 


Shall be set to zeroes. 


B_KeyLeasePeriod 


RW 


16 


Timer value used to indicate how long the B_Key Protec- 
tion bit is to remain non zero after a BMSet(BKeylnfo) MAD 
that failed a B_Key check is dropped. The value of the 
timer indicates the number of seconds for the lease period. 
With a 16 bit counter, the period can range from one sec- 
ond to approximately 18 hours. 0 shall mean infinite. See 
16.2.4.5 B Kev Recoverv on oaae 760 for details. 


B_KeyViolations 


RO 


16 


Number of MADs that have been received at this node 



since power-on or reset that have been dropped due to a 
failed B_Key checit if such a counter is implemented. 0th- 
enwise this shall be OxFFFF. 



16.2.3.4 IB-ML ATTRIBUTES 



16.2.4 B Key General Use 



16.2.4.1 B Key Assumptions 



See InfiniBand Architecture Specification, Volume 2, Chapter "Hardware 
Management", section "Management Commands" lor a descnpWon of the 
BM class specific attributes and their format. 



The BM includes the Baseboard Management Key (B_Key) in the BM 
MAD to obtain authorization. The B_Key is used to authenticate a trusted 
source. This model assumes that the fabric has some level of physical se- 
curity. 



1 ) To use the correct key for each node, the BM or a higher-level B_Key 
manager keeps track of the keys for the nodes that it is managing. 

2) If a backup BM exists, it shares the B_Keys for ease of fail-over. 

3) A BM may have exclusive access to a set of nodes, by using a B_Key 
which is only known by that BM and those particular nodes. Since 
nodes reply to Baseboard Manager's requests using their own 
B_Key, if a Baseboard Manager assigns more than one B_Key to de- 
vices on its management domain, it needs to run on an HCA whose 
BMA can respond to all such B_Keys. 

4) The BM sets the B_Key, the B_Key Protection bit, and the B_Key 
lease period in the BKeyInfo Attribute with one BMSet(Bkeylnfo) 
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MAD. A successful completion of this assignment indicates to the BM 
that it has taken ownership of the node. 



16.2.4.2 B Key Protection Scope 



16.2.4.3 B Key Operation 



Each BMA (Baseboard Management Agent) in a node has one B_Key. 
Table 210 on page 759 shows the scope protected by that B_Key. The se- 
mantics are explicitly defined in InfiniBand Architecture Specification, 
Volume 2, Chapter "Hardware Management". 

Table 210 B_Key Protection Scope 



Source 


Targeted Entity 


Protection 


BM 


Read and Writes to 

- ClassPortlnfo (e.g. BM LID In TrapLID) 

- BKeylnfo (e.g. B_Key, B_Key Protection Bit) 


yes 


BM 


Attributes causing reads from and writes to IB-ML 

- Modulelnfo^ 

- IB-module Specific Data 

- Chassislnfo'' 

-cme'' 

- Other IB-ML devices 


yes 


IB-ML Man- 
aged Unit 


Attributes causing reads from and writes to IB-ML 

- IB-module VPD 

- IB-module Specific Data 

- Chassislnfo^ 

- Other IB-ML devices 


no 



a. The IB-Module vendor protects the factory-programmed portion of Modulelnfo 

against writes even if a proper B_Key is provided. 

b. The Chassis vendor protects the factory-programmed portion of Chassislnfo 

against writes even if a proper B_Key is provided. If further protection is 
desired, the CME or the Chassis provides it. 



CI 6-11: The BMA shall check the B_Key contained in incoming MADs. 

The success and effect of the check depends on the value of the B_Key 
and B_Key Protection bit of the BMA and on the method and attribute con- 
tained in the incoming MAD. 



Table 211 B_Key Check 



BMA's 
B_Key 


BMA's B_Key 
Protection Bit 


MAD's method 


Success 


zero 


any 


any 


yes 


non-zero 


any 


BMSetO, BMSendO 


if MAD's B_Key equals BMA's 
B_Key 
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Table 211 B_Key Check 



BMA's 
B_Key 


BMA's B_Key 
Protection Bit 


MAD's method 


Success 


non-zero 


0 


BMGetO 


yes 


non-zero 


1 


BMGetO 


yes^ 



a. Even though the check succeeds, the B_Key value in the BKeylnfo attribute shali be returned as 
zero. 

C16-12: If B_Key check fails, the BMA shall: 

1) Drop the MAD. 

2) Increment a B_Key Violation counter if supported. 

3) Send a BKeyViolation trap if traps are supported by the BMA. 

4) Start a countdown timer with the B_Key lease period value. 



16.2.4.4 B Key Initialization 



16.2.4.5 B Key Recovery 



CI 6-1 3: At power up or reset, the B_Key, B_Key Protection bit and B_Key 
lease period shall be set to zero if NVRAM is not used; otherwise, they 
shall be set to the values stored in NVRAM. 

Using a BMSet(BKeylnfo), the BM may assign the subsequent B_Key, 
B_Key Protection bit and B_Key lease period. 



The B_Key lease period timer starts when a B_Key check fails. At this 
time, the node sends a trap to the BM (if traps are supported and if the BM 
stored its information in the trap components of the ClassPortlnfo at- 
tribute). This trap serves as a request to the BM to refresh the lease period 
by issuing a BMSet(BKeylnfo). A successful BMSet( BKeylnfo) will stop 
the timer and will rearm it. 

If the BM that originally set the B_Key has gone away, then the lease pe- 
riod expires- clearing the B_Key Protection bit and allowing anyone to 
read (and then set) the B_Key. 

In the case where a node starts with NVRAM B_Key and B_Key Protec- 
tion bits set and the TrapLID is zero (because no BM has come around to 
set it), the node has no BM to send the trap to. In this case, the node does 
not send the trap and the lease period timer will expire, causing eventual 
take over by a new BM. 

With the BMGet(BKeylnfo), any BM can detect whether a B_Key is set (al- 
though hidden) based on the B_Key Protection bit. If the B_Key Protection 
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bit is set, the B_Key is set and hidden. OthenA^ise the returned B_Key is 
the real one even if it is zero. 



16.2.4.6 Levels of Protection 



There are four different protection levels based on the B_Key, depending 
on the system requirements. 

Table 212 Protection Levels 



B_Key 


B_Key 
Protection bit 


B.Key 
Lease 
Period 


Description 


0 


any 


any 


No protection provided. Any BM can issue sets and sends. 


non-zero 


0 


n/a 


Protection provided, but allows BMs to read the B_Key in the 
node. 


non-zero 


1 


non-zero 


Protection provided and does not allow anyone to read the 
B_Key in the node until the lease period has expired. The B_Key 
lease period is a mechanism to allow the B_Key to be protected 
only for a given amount of time. 


non-zero 


1 


0 


Protection provided and does not allow the B_Key in the node to 
be read by other BMs. 

It must be noted that if the lease period was set to 0 (Infinite) and 
the BM that set it dies, there is no possibilities for other BMs to 
ever read it. So if the B_Key is not provided by some unspecified 
way to the other BMs, the BMA of this node will never be acces- 
sible again. 



16.3 Device Management 



The Device Management Agent is optional. 

10 Devices and I/O controllers (IOC) are not directly connected to the IBA 
fabric. An I/O Unit (lOU) containing one or more lOGs is attached to the 
fabric via a TCA. The TCA is responsible for receiving packets from the 
fabric and delivering complete, valid messages to lOCs, and vice-versa. 
The TCA might use memory resources supplied by the IOC to assemble 
the packet and notifies the IOC when the complete packet is available for 
consumption. IOC is then responsible for executing I/O requests such as 
network sends and receives or disk reads and writes over a device spe- 
cific interface such as Ethernet or SCSI. 

This chapter does not address direct management of end devices such as 
disk drives but focuses on the infrastructure, related methods, data for- 
mats and attributes to support lOU/IOC management over the fabric. This 
chapter defines mechanisms to send and receive device management 
packets between two fabric attachment points such as a HCA and a TCA. 
The mechanisms required to translate MADs into a format that the end de- 
vices understand and how the data is delivered and retrieved from an end 
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device is device-specific and therefore is not addressed in this specifica- 
tion. 

The IBA is based on message passing. For lOU and IOC, the messages 
fall into three classes: fabric configuration, unit management/configura- 
tion, and I/O transaction: 

• Fabric configuration messages that are processed by the Subnet 
Management Agent (SMA) are defined in Chapter 14. 

Messages specific to configuring and managing a device that are re- 
ceived through the General Services Interface (GSI) are described in 
this section. 

lO transaction messages are not defined in this document. I/O trans- 
action messages include those messages used by an initiator to re- 
quest I/O services from an IOC, messages containing user or 
application data, and messages used by the IOC to provide a com- 
pletion notification (ending status) to the requestor. Also included in 
this class are in-band configuration messages (parameters, etc.) di- 
rected only to an IOC, and not to the larger lOU as a whole. These 
messages travel as I/O requests but perform management functions 
specific to the I/O controller. 

Although this chapter tends to use language implying that an lOU "con- 
tains" lOCs, there are no restrictions on how lOCs are connected to, or 
served by, the TCA. Figure 171 Architectural Model for an I/O Unit on 
page 762 provides the architectural and connection models for an lOU, 
consisting of a TCA and one or more lOCs. 



^ I/O Unit 




I/O Controller 



I/O Controller 



o 
o 
o 



I/O Controller 



I/O Port or Devices 



Figure 171 Architectural Model for an I/O Unit 
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16.3.1 MAD Format 



o16-4: The datagrams in the Device Management class shall conform to 
the MAD format and use as specified in 13.4 Management Datagrams on 
page 573 and further customized in Figure 172 Device Management MAD 
Format on page 763 and Table 213 Device Management MAD Fields on 
page 763 below. 

Figure 172 Device Management MAD Format 



bytes 




0 


Common MAD Header 




20 


24 


Reserved 




60 


64 


Data 




252 



Table 213 Device Management MAD Fields 



Field 


Length 


Description 


Common MAD 
Header 


24 bytes 


Common MAD Header as described in 13.4.2 Manaaement Dataaram Format on oaae 
574 


Reserved 


40 bytes 


Shall be set to zeroes. 


DevMgt Data 


192 bytes 


192 bytes of Device Management payload. The structure and content depends upon the 
Method, Attribute and Attribute Modifier fields in the header. 
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16.3.1.1 Status Field 



The Status field is described in 13.4.7 Status Field on page 587 . Some 
class-specific bits are defined. 

Table 214 Device Management Status Field 



Bits 


Name 


IVIeaning 


0-7 




Common bits as defined in 13.4.7 Status Field 
on Dace 587 


8 


NoResponse 


IOC Not responding 


9 


NoServiceEntries 


Service Entries are not supported 
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Table 214 Device Management Status Field 



16.3,2 Methods 



Bits 


Name 


Meaning 


10-14 




Reserved 


15 


GeneralFailure 


IOC General Failure 



Among the services that a TCA provides to an initiating client is a mech- 
anism to deliver detailed information about the I/O resources (e.g. lOCs) 
supported by the lOU. This information transcends a simple count of the 
number of lOCs supported to provide details of each IOC such as a QUID, 
a vendor-unique ID, product revision levels, and other information that is 
specific to a given IOC. The purpose of the detailed information is to let a 
system configuration manager allocate the lOU's resources to various cli- 
ents located on the IBA fabric, and to provide a common way for host re- 
source managers to determine the characteristics of lOUs and lOCs. This 
allows the proper driver to be associated with each controller. 

The profiles are requested and returned through the GSI, which is an un- 
reliable datagram service. The actual access QP and DLID may be redi- 
rected by the GSI. The lOUnitlnfo attribute contains information on the 
number of lOCs the unit can support (IOUnitlnfo:MaxControllers). This 
value is the length of the IOUnitlnfo:ControllerList, which has an entry for 
every possible controller "slot" (which may be physical or logical). Each 
entry in the ControllerList component shows whether a controller is 
present. For each controller, the lOControllerProfile attribute contains in- 
formation such as the type of controller and the number of connections the 
IOC can support. Each controller has a ServiceEntries attribute associ- 
ated with it. ServiceEntries is a table of ServicelDs that the controller ad- 
vertises to its clients. The format of the lOUnitlnfo, lOControllerProfile and 
ServiceEntries structures are defined in 16.3.3 Attributes on page 765 . 



Table 215 Device Management Methods 



Method Type 


Value 


Description 


DevMgtGetO 


0x01 


Request an lOU to return (read) Device Management class attributes such as profile 
or a list of controllers currently installed. 


DevMgtSetO 


0x02 


Request an lOU to set (write) an attribute. The object will 
issue a DevMgtGetRespO as a response. 


DevMgtGetRespO 


0x81 


lOU responds to an attribute Get or Set request. 


DevMgtTrapO 


0x05 


Unsolicited datagram sent to the Device Management entity. Contains the Notice 
Attribute as defined in 13.4.8.2 Notice on paae 591 to identify the trao. 


DevMgtTrapRepress 


0x07 


Block repetition of notification. 
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16.3.3 Attributes 



This section specifies the format of the attributes used for managing the 
lOU. Messages used as part of I/O transactions are not specified in this 
document. The term "device" refers to actual devices sitting behind lOCs. 
The way they are numbered is implementation-specific. 

Table 216 Device Management Attributes 



Attribute Name 


Attribute 
ID 


Attribute 
Modifier 


Description 


ClassPortlnfo 


0x0001 


Ox0000_0000 


See 13.4.8.1 ClassPortlnfo on oaae 589 


Notice 


0x0002 


0x0000 0000- 
OxFFFF_FFFF 


See 13.4.8.2 Notice on oaae 591 


lOUnitlnfo 


0x0010 


OxOOOO_0000 


List of all IOCS present in a given lOU. Each lOU may support up to 
OxFF controllers. 


lOControllerProfile 


0x0011 


Ox0000_0001- 

UXUUUU_UUr r 


IOC Profile Information. Attribute Modifier identifies the IOC. 


ServiceEntries 


0x0012 


0x0001 0000- 
0x00FF_FFFF 


List of supported services and their associated Service IDs. Each IOC 
has a table with at most 0x100 ServiceEntries. 
The attribute modifier is structured as follows: 

- the upper 16 bits identify the IOC 

- the lower 16 bits specify a range of up to four Service Entries to be 
retrieved. The first 8 bits of the lower 16 bits specify the beginning and 
the last 8 bits the end of the range. 


Reserved 


0x0013- 
0x001 F 


0x0000 0000- 
OxFFFF_FFFF 


Reserved 


DIagnosticTimeout 


0x0020 


OxO0OO_OO01 - 
OxFFFF_FFFF 


Response indicates maximum time for completion of diagnostic test. Tar- 
get device is identified by the Attribute Modifier. Tests not completing 
within this period may indicate device failure. Specified in multiples of 
milliseconds. 


PrepareToTest 


0x0021 


0x0000 0001 - 
OxFFFF_FFFF 


A Set with this Attribute instructs the device specified by the Attribute 
Modifier to prepare for diagnostic test. 

A Get of this Attribute will result in the appropriate Response Status 

being set as follows: 

0x0000 = Ready for diagnostic test 

0x0100 = Invalid Attribute Modifier 

0x0200 = Device not ready 

0x0400 = Device not responding 

0x0800 = Diagnostics not supported 

0x1000 - 0x8000 = Reserved 


TestDeviceOnce 


0x0022 


0x0000 0001 - 
OxFFFF_FFFF 


A Set instructs the device specified by the Attribute Modifier to initiate a 
single diagnostic test and run it once. 

Vendor-unique attribute values may be defined to initiate specific test 
instructions. 
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Table 216 Device Management Attributes 



Attribute Name 


Attribute 
ID 


Attribute 
Modifier 


Description 


TestDeviceLoop 


0x0023 


0x0000 0001 - 
OxFFFF_FFFF 


A Set instructs the device specified by the Attribute Modifier to initiate a 
single diagnostic test and run it continuously in a loop. 
Vendor-unique attribute values may be defined to initiate specific test 
instructions. 


DIagCode 


0x0024 


0x0000 0001 - 
OxFFFF_FFFF 


Vendor-specific Diagnostic infornnation for the device specified by the 
Attribute Modifier See 14.2.5.6.1 Interoretation of DiaoCode on oaoe 
640. 


Reserved 


0x0025- 
OxFEFF 


0x0000 0000- 
OxFFFF_FFFF 


Reserved 


Vendor specific 


OxFFOO- 
OxFFFF 


0x0000 0000- 
OxFFFF_FFFF 


Vendor-unique attribute values may be defined to initiate specific test 
instructions. 


Table 217 Device Management Attribute / Method Map 
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16.3.3.1 ClassPortInfo 



Attribute Nanne 


DevMgtGet 


DevMgtSet 


DevMgtTrap 


ClassPortInfo 


X 


X 




Notice 


X 


X 


X 


lOUnitlnfo 


X 






lOControllerProfile 


X 






ServiceEntries 


X 






DiagnosticTimeout 


X 






PrepareToTest 


X 


X 




TestDeviceOnce 




X 




TestDeviceLoop 




X 




DiagCode 


X 







The ClassPortInfo attribute is described in 13.4.8.1 ClassPortInfo on page 
589 . No class-specific bits are defined. 

Table 218 Device Management ClassPortlnfo:CapabilityMask 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.8,1 Class- 
PortInfo on oaae 589 
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Table 218 Device Management ClassPortlnfo:CapabilityMask 



Bits 


Name 


IVIeaning 


8-15 




Class-specific bits are reserved 



16.3.3.2 Notice 



The Notice attribute is described in 1 3.4.8.2 Notice on page 591 . It is used 
for one optional generic trap. 



Table 219 Device Management Traps 



Name 


Type 


Number DataDetails 


ReadyToTest 


Informational 


514 Dewce <DEVICE> readiness is <STATUS> where status is tlie same as would 
have been returned by a Get(PrepareToTest) with device as the attribute modi- 
fier. 



o16-5: Device Management Traps use the following layout for the Data- 
Details component of the Notice attribute, see Table 220 Notice DataDe- 
tails For Trap 514 on page 767 . Fields shall be filled with the information 
corresponding to the description of a given trap.. 



Table 220 Notice DataDetails For Trap 514 



Field 


Length(bits) 


Description 


STATUS 


16 


Readiness status 


DEVICE 


32 


Device number 


Padding 


384 


Shall be ignored on read. Content 






is unspecified. 



16.3.3.3 lOUNITlNFO 



Table 221 lOUnitlnfo 



Component 


Access 


Length(bits) 


Description 


ChangeJD 


RO 


16 


Incremented, with rollover, by any change to ControllerList. 


Max Controllers 


RO 


8 


Number of slots in ControllerList. 


Reserved 


RO 


7 


Reserved for future use. 


Option ROM 


RO 


1 


Indicates presence of Option ROM. 1 = Present; 0 = Absent. 
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Table 221 lOUnitlnfo 
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Component Access Length(bits) 



Description 



ControllerList 



RO 



1024 



A series of 4-bit nibbles with each representing a slot in the lOU. Each 4-bit 
nibble can take the following values: 

- 0x0 = IOC not installed 

- 0x1 = IOC present 

- 0x2-0xe = reserved 

- Oxf = slot does not exist 

Bits 7-4 of the first byte (lowest offset) represent slot 0, bits 3-0 represent slot 
1, bits 7-4 of the second byte represent slot 2, bits 3-0 represent slot 3, and so 
on. 



16.3.3.4 IOControllerProfile 



lO Subclass 



Table 222 IOControllerProfile 



Component 


Access 


Length(bits) 


Description 


GUID 


RO 


64 


An EUI-64 GUID used to uniquely identify the con- 
troller. This could be the same one as the Node/Port 
GUID If there is only one controller. 


VendorlD 


RO 


24 


10 controller vendor ID, IEEE format 


Reserved 


RO 


8 


Reserved for proper alignment. 


DevicelD 


RO 


32 


A number assigned by the vendor to identify the type 
of controller This can be used by an Operating Sys- 
tem to select a device driver. 


Device Version 


RO 


16 


A number assigned by the vendor to identify the 
device version. 


Reserved 


RO 


16 


Reserved for proper alignment. 


Subsystem VendorlD 


RO 


24 


ID of the vendor of the enclosure, if any, in which the 
I/O controller resides in IEEE format; othen/vise zero. 


Reserved 


RO 


8 


Reserved for proper alignment. 


SubsystemID 


RO 


32 


A number identifying the subsystem where the con- 
troller resides. 


10 Class 


RO 


16 


OxOOOO-Oxfffe = Reserved pending I/O class 
specification approval. 
Oxffff = Vendor-specific 



RO 



16 



OxOOOO-Oxfffe = Reserved pending I/O subclass 
specification approval. 
Oxffff = Vendor-specific 

This must be set to Oxffff if the I/O Class component 
is set to Oxffff. 
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Table 222 


lOControllerProfile 


Component 


Access 


Length(bits) 


Description 


Protocol 


RO 


16 


OxOOOO-Oxfffe = Reserved pending I/O protocol 
specification approval. 
Oxffff - Vendor-specific 

This must be set to Oxffff if the I/O Class component 
is set to Oxffff. 


Protocol Version 


RO 


16 


Protocol specific. 


Service Connections 


RO 


16 


Number of service connections controller can sup- 
port. 


Initiators Supported 


RO 


16 


Number of initiators that this IOC can support. 


Send Message Depth 


RO 


16 


Maximum Depth of the Send Message Queue. 


RDMA Read Depth 


RO 


16 


Maximum Depth of the per-channel RDMA Read 
Queue. 


Send Message Size 


RO 


32 


Maximum size of Send Messages in bytes. 


RDMA Transfer Size 


RO 


32 


Maximum size of outbound RDMA transfers initiated 
by the IOC - in bytes. 


Controller Operations Capability Mask 


RO 


8 


Supported operation types of this I/O controller. A bit 



set to 1 for affirmation of supported capability. 

Bit: Name; Description 

0: ST; Send Messages To lOCs 

1 : SF; Send Messages From lOCs 

2: RT; RDMA Read Requests To lOCs 

3: RF; RDMA Read Requests From lOCs 

4: WT; RDMA Write Requests To lOCs 

5: WF; RDMA Write Requests From lOCs 

6: AT; Atomic Operations To lOCs 

7: AF; Atomic Operations From lOCs 
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Controller Services Capability Mask 



RO 



Supported operation types of tiiis I/O controller. A bit 

set to 1 for affirmation of supported capability. 

Bit: Name; Description 

0: CS; Console Services 

1 : SBWP; Storage Boot Wire Protocol 

2: NBWP; Network Boot Wire Protocol 

3-7: Reserved, For future services 



Service Entries 


RO 


8 


Number of entries in the ServiceEntries table. 


36 


Reserved 


RO 


72 


Reserved for future use. 


37 


ID String 


RO 


512 


An ASCII text string for identifying the controller to 
operator. 


38 
39 
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16.3.3.5 ServiceEntries 



Table 223 ServiceEntries 



Component 


Access 


Length(bits) 


Description 


ServiceName_1 


RO 


320 


String of Service name in text format. 


ServicelD_1 


RO 


64 


An identifier of the associated Service. 


ServlceName_2 


RO 


320 


String of Service name in text format. 


ServicelD_2 


RO 


64 


An identifier of the associated Service. 


ServlceName_3 


RO 


320 


String of Service name in text format. 


ServicelD_3 


RO 


64 


An identifier of the associated Service. 


ServiceNanne_4 


RO 


320 


String of Service name in text format. 


Service! D_4 


RO 


64 


An identifier of the associated Service. 



16.3.3.6 DiAGNOSTICTlMEOUT 



Table 224 DiagnosticTimeout 



Component Access 


Length(bits) 




Description 


MaxDiagTime RO 


32 


Maximum time 


to finish a diagnostic operation in milli- 






seconds 





16.3.3.7 PREPARETOTfeST 



Table 225 PrepareToTest 



Component Access Length(bits) 



Description 



This attribute does not have any components 



16.3.3.8 TestDeviceOnce 



Table 226 TestDeviceOnce 



Component Access Length(bits) 



Description 



This attribute does not have any components 
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16.3.3.9 TestDeviceLoop 1 

2 
3 

Table 227 TestDeviceLoop 4 

5 

• 6 



Component Access Length(blts) Description 



Table 228 DiagCode 



This attribute does not have any components 7 

8 

16.3.3.10 DiagCode g 

10 
11 
12 
13 
14 
15 
16 
17 
18 



Component 


Access 


Length(bits) 


Description 


Vendor-specific 



16.3.4 Device Diagnostic Framework 



16.3.4.1 Behaviors 



Device Diagnostics allows the identification of faults in devices behind the 
target channel adapter. As such, it complements other sections of this 

specification that describe how problems at the fabric and node level may ^ ^ 

be identified and isolated. 20 

21 

The device diagnostic framework is intended to support tests within an ac- 22 

tive fabric. It is versatile enough, however, to accommodate vendor- 23 

unique approaches that may include retrieval of power-on data. It should 24 
be noted that some, and perhaps most, devices may not permit simulta- 
neous use of I/O transaction messages and diagnostics. Unless data is 

flushed from internal buffers, for example, corruption or loss of user data 26 

might occur. Further, it is expected that the diagnostics tests would require 27 

setting the device to an initial, known state. For that reason, provision will 28 

be made to put the device into a "ready" state prior to test, which will likely 29 

cause I/O transactions to be held off. This may, in turn, cause established jq 
connections to time out, and other management notices to be sent. 

In general, device diagnostics should be used with great care, and with 

full understanding of the potential impact to I/O transactions to the target 33 

device. It is best used during periods of initial configuration, major main- 34 

tenance, or as a tool of last resort. 35 

36 
37 

The Device Management class of MADs (see 16.3 Device Management 33 

on page 761 ) is used for diagnostics. Within that class, standard methods gg 
as defined in Table 215 Device Management Methods on page 764 are 

utilized. Attributes specific to device diagnostics are defined by which ^ 
vendor-supplied tests may be invoked, and the results of completed tests 

42 
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16.4 SNMP Tunneling 



then determined. Attribute Modifiers are used to indicate the Port Number 
of the device under test. Results are reported in a format consistent with 
the 1 6-bit DiagCode used for the Node Diagnostics (see 14.2.5.6.1 Inter- 
pretation of DiaaCode on page 640 ). 

The PrepareToTest attribute within the DevMgtSet method places the de- 
vice into a test-ready state. The time required to complete this step is not 
predictable, as it may involve flushing data from cache memory, reinitial- 
izing SCSI ports, etc. The device indicates its readiness for test by sig- 
naling the lOU to send an Informational Trap. 

Alternatively, a Get method on this attribute will return information per- 
taining to the specific device's ability or readiness for test. This allows the 
status to be polled on a periodic basis, or to determine that the device 
does not support diagnostic tests. 

Two modes are provided for initiating diagnostics: single test mode and 
continuous test mode. In single test mode, a single test sequence is initi- 
ated by setting a non-zero value for the TestDevice attribute, with MSB=0. 
Once initiated, this vendor-defined test will run to completion. Because 
tests will vary by device technology and by vendor, the time-to-completion 
is inherently unpredictable. To detect errant devices which are unable to 
complete their diagnostic test, a DiagnosticTimeout attribute may be re- 
trieved in advance of test initiation, which indicates the maximum allow- 
able period for completion. Results of the completed diagnostic test are 
obtained through the DiagCode attribute of the GetResp method. 

The continuous test mode can assist in detecting problems that are tran- 
sient in nature, be used to initiate endurance-related tests. The contin- 
uous-test mode is initiated by setting a non-zero value for the TestDevice 
attribute, with MSB=1 . Results of the last completed diagnostic test are 
obtained through the DiagCode attribute of the GetResp method. 

Interpretation of results obtained through the GetResp method is vendor- 
specific. 

It is beyond the scope of this specification to define a set of white-box, 
technology-specific diagnostic tests. Rather, the intent is to allow initiation 
of a vendor-supplied test sequence, for which the expected outcome 
would be either success or failure. The DiagCode format, however, allows 
flexibility for the vendor to provide specific, coded information about the 
test results. 



The SNMP Tunneling Agent is optional. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^"^ Trade Association 



Page 772 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



General Services 



October 24, 2000 
FINAL 



This section describes the Management Datagrams used to report native 1 
SNMP tunneling over the IBA. 2 



16A1 MAD Format 



SNMP, or Simple Network Management Protocol, consists of a set of stan- 
dards for network management, a protocol, and a database specification 
to uniformly address managed information objects structured in a format 
called MIB-II (Management Information Base version 2). SNMP was orig- 
inally specified in RFC1157, and later, RFC1902 for SNMP v2. 

The structure of management information was originally laid out in 
RFC1155; the current MIB-II standard resides in RFC 1213. The sup- 
ported RFCs to which this document references will be RFC1 902-1 908, 
for SNMPv2, although SNMPv2 supports SNMPvl , and RFC121 3, for the 
MIB-II standard. 

SNMP tunneling is a supported option to the InfiniBand architecture as a 
Management Datagram service. Devices advertise support for the SNMP 
tunneling service by use of the IsSNMPSupported Capability in Portlnfo 
Attribute. If the value is non-zero a given device may be queried via the 
GSI for the QP and LID to access the SNMP service. Note that this capa- 
bility allows for another port to supply SNMP tunneling services by proxy. 

This section describes the required class-dependent behavior of the dat- 
agrams in this class. 



0I6-6: The datagrams in the SNMP tunneling class shall conform to the 
MAD format and use as specified in 13.4 Management Datagrams on 
page 573 and further customized in Figure 173 SNMP Tunneling MAD 
Format on page 773 and Table 229 SNMP Tunneling MAD Fields on page 
774 below. 

Figure 173 SNMP Tunneling MAD Format 
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Figure 173 SNMP Tunneling MAD Format 



bytes 




64 


Data 




255 



Table 229 SNMP Tunneling MAD Fields 



Field Name 


Length 


Description 


Connnnon MAD 
Header 


24 bytes 


Common MAD Header as described In 13.4.2 Manaaement Dataaram For- 
mat on oaae 574 


Reserved 


32 bytes 


Set to all 0. 


Raddress 


4 bytes 


Opaque address field that is used by SNMP agent to forward SNMP pack- 
ets using SNMP redirect features. 


Payload Length 


1 byte 


Number of valid data bytes in entire SNMP packet being transferred. 


Segment Number 


1 byte 


Segment number of a segmented SNMP packet. 


Source LID 


2 bytes 


Local address of the SNMP packet sender. 


Data 


192 bytes 


Attribute data is mapped bit for bit from the format described in the follow- 
ing sections to the start of this data field. If the attribute is smaller than the 
data field, the content of the remainder of the data field is unspecified. 



16-4.1,1 Status Field 



The Status field is described in 13.4,7 Status Field on page 587 . No class- 
specific bits are defined. 

Table 230 SNMP Tunneling Status Field 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.7 Status Field 
on oaae 587 


8-15 




Glass-specific bits are reserved 



16.4.2 Methods 



This class utilizes the common methods described in 13.4.5 Management 
Class Methods on pace 577 

Table 231 SNMP Tunneling Methods 



Method Type 



Value 



Description 



SnmpGetO 



0x01 



Request a get (read) of an Attribute. 
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Table 231 SNMP Tunneling Methods 



Method Type 


Value 


Description 


SnmpSetO 


0x02 


Request a set (write) of an Attribute. 


SnmpGetRespO 


0x81 


Response from a get or set request. 


SnmpSendO 


0x03 


Send an Attribute to a node. 



16.4.3 Attributes 



Table 232 SNMP Tunneling Attributes 



Attribute Name 


ID 


Mnriuuie 
Modifier 


Description 


Length 


ClassPortlnfo 


0x0001 


OxOOOO_0000 


See 13.4.8.1 ClassPortlnfo on oaae 589 




Community Info 


0x0010 


OxOO0O_O00O 
OxOOOO_0001 
0x0000_0002 
0x0000_0003 


Community Name Data Store 


64 bytes 


Pdulnfo 


0x0011 


OxOOOO_0001 


First SNMP segment 


192 bytes 






0x0000_0002 


Intermediate SNMP segment 








0x0000_0003 


Last SNMP segment 








0x0000_0004 


First and Last SNMP segment 








0x8000_0001 


First redirected SNMP segment 








0x8000_0002 


Intermediate redirected SNMP segment 








0x8000_0003 


Last redirected SNMP segment 








0x8000_0004 


First and Last redirected SNMP segment 
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Table 233 SNMP Tunneling Attribute / Method Map 



Attribute Name 


SnmpGet 


SnmpSet 


SnmpSend 


ClassPortlnfo 


X 


X 




Communitylnfo 






X 


PDUInfo 






X 
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16.4.3.1 ClassPortInfo 



The ClassPortInfo attribute is described in 13,4.8.1 ClassPortInfo on page 
589 . No class-specific bits are defined. 

Table 234 SNMP Tunneling ClassPortlnfo:CapabilityMask 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.8.1 Class- 






PortInfo on oaae 589 


8-15 




Class-specific bits are reserved 



16.4.3.2 COMMUNITYlNFO 



Table 235 Communltylnfo 



Component 



Settability Length(bits) 



Description 



Community 
Name 



RW 



512 



Community Name, used for authentication. The 
SNMP standard specifies a 255 byte community 
name field in the SNMP packet. This field is used for 
authentication by the SNMP protocol. This Attribute 
stores the community name in four 64 byte segments. 
The SNMP agent residing in the Endnode concate- 
nates the four segments in order 0-3 to form the 255 
byte string. 



16.4.3.3 PduInfo 



Table 236 Pdulnfo 



Component 


Settability 


Length(bits) Description 


PduData 


RW 


1536 SNMP data segment 
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16.4.4 Operations 



Figure 174 depicts the SNMP PDU format (shown in network byte order, 
as published in IETF publications). This is abstracted on top of the MAD 
datagram. 



SNMP Message 



Version Community 



SNMP PDU 



PDU Type 


request id 


0 


0 


variable bindings 


GetReques 
SetReques 


5t, GetNextF 
tPDUs 


Request, 










PDU Type 


request id 


error status 


error index 




variable bindings 


GetResp 


PDU 












PDU Type 


enterprise 


agent addr 


generic trap 


specific trap 


time stamp 


variable bindings 



Trap PDU 



name 1 



value 1 



name 2 



value 2 



name N 



value N 



Variable Bindings for any PDU 



Figure 174 SNMP PDU Format 



The maximum object definable in a MIB is 256 bytes, but the maximum 
payload available In a SNMP datagram Is 192 bytes. Because SNMP 
cannot be redefined to suit the MAD datagram, the InfiniBand SNMP 
transport provides segmentation/reassembly. 

A packet consists of one or more segments. If necessary, a packet will be 
segmented at the source, transmitted, and reassembled at the target. 
Using the SNMP datagram, the source specifies the SnmpSend method, 
Pdulnfo Attribute, then sets the Attribute modifier, segment number (if the 
packet is segmented), and the payload length fields to delimit and account 
for segments. 

o16-7; When SNMP packets must be segmented into multiple MADs, the 
data field of all but the last MAD transferred shall be completely filled (192 
bytes of data). 
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0I6-8: If the second or subsequent MAD is not received within a timeout 1 

of ClassPortlnfo:ResponseTimeValue of the sender, all received MADs 2 

for an SNMP packet shall be discarded and any additional MADs received 3 

shall also be discarded. . 

4 

o16-9: The Transaction ID field of all the MADs of the SNMP packet shall ^ 
be the same, and shall conform to the uniqueness of Transaction IDs as ^ 
described in 13.4.6.4 TransactionID usage on oaae 586 . 7 

8 

Because the destination is already known by the sender of the MAD 9 
packet, there is no need to include it in the MAD packet. However, be- 
cause the sender may be expecting a response from the agent receiving 
this SNMP request, the original source LID is provided so the agent knows 
where to send a reply. ^ ^ 

13 

The maximum number of bytes transmittable in a single UDP Packet is 14 
slightly over 81 92. The SNMP management class can transfer a packet of 1 5 
up to 49,152 bytes. This is enough to accommodate any incoming 
SNMP/UDP packet and allow for flexibility if management packets arrive 
from a transport other than TCP/UDP. 

18 

0I6-IO: When the pieces are reassembled, the SNMP Message shall be ^9 
extracted and passed up to the agent or manager for processing. 20 

21 

If a MAD is marked First and Last with the attribute modifier, it is the only 22 
segment in the packet. No segmentation has occurred so no reassembly 23 
is required, extract and pass it up. 24 

If SNMP Redirect is specified in the attribute modifier, the packet is meant 

for a target managed by the proxy agent processing the packet. The proxy 26 

agent will need to parse the packet to extract the Raddress value of the 27 

final destination to reformat the PDU for further transport along the new 28 

interconnect. 29 
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16.4.4.1 SNMP Targets for Beyond the InfiniBand Endnode 



SNMP 
(Host Channel Adaptor) 
SNMP Proxy Agent 



Managed 
Objects 




MAC 
ID1 



Target Channel 



Legacy Subnet 1 



Adaptor ywth legacy Subnet 2 
Management Station '-"^'^ 




Managed 
Objects 




Figure 175 SNMP Proxy Agents 



To target a remote managed object not directly connected to the Infini- 
Band fabric requires the use of an SNMP Proxy Agent. See Figure 175. 
The basic function of a Proxy Agent is to receive SNMP packets passed 
up from the InfiniBand Endnode SNMP agent and fonward them to that re- 
mote managed agent. These remote agents are, as such, not directly con- 
nected to the InfiniBand fabric and thus cannot be managed through it 
unless an intermediate device acts on its behalf to receive and send over 
the unsupported interconnect. 
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Possibly fragmented for transport through 
fabric 



Host Management Station 



Original SNMP Packet 





Paytoad 




F 


Paytoad 






Paytoad 






Paytoad 






Paytoad 


f r 




Paytoad 





Reassemble SNMP Packet 



Reassembled SNMP Packet 






Target Channel 
Adaptor with 
SNMP Proxy Agent 




Direct the reassembled pacl<et through the 
specified subnet to the specified MAC ID using 
the legacy protocol's transport mechanism. 



0I6-II: The InfiniBand architecture shall be able to accommodate such 1 
legacy transports by redirecting SNMP packets destined for these man- 2 
aged nodes. 3 

4 
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Managed Object 



Figure 176 SNMP PDU Segmentation 

SNMP targeting for beyond the InfiniBand Endnode (such as an Infini- 
Band device attached to a TCA that supports SNMP) is accomplished by 
an SNMP redirect. An SNMP packet destined for such a redirection will 
contain one of the SNMP redirect features and specify the destination ad- 
dress in the raddress field of the SNMP class datagram. This will allow the 
Proxy SNMP Agent to reassemble the SNMP packet from its fragments (if 
any) so it may re-encode the packet for the legacy transport over which it 
travels to reach its final destination. 

16.4.4.2 Trap Event Subscription 

A node may request SNMP traps from a given node be sent to it. This is 
done by setting the ClassPortlnfo Attribute with the LID and QP appropri- 
ately. The SNMP agent will transmit the Trap PDU as a sequence of 
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16.5 Vendor-specific 



16.5.1 MAD Format 



SNMP datagrams to the destination node, that is it is not using the nnethod 
TrapO but the method Send() with the Trap PDU as the SNMP Data. 



The Vendor-specific Agent is optional. 

Vendor-specific operations can be defined using the vendor-specific man- 
agement classes. 

Vendors are free to define new methods and attributes and their use, pro- 
vided that they conform to the common MAD format and methods defined 
herein, and do not conflict with the stated restrictions on method and at- 
tribute utilization. 



016-12: The datagrams in these Vendor-specific classes shall conform to 
the MAD format and use as specified in 13.4 Management Datagrams on 
page 573 and further customized in Figure 177 Vendor MAD Format on 
page 781 and Table 237 Vendor MAD Fields on page 781 below. 

Figure 177 Vendor MAD Format 



bytes 




0 


Common MAD Header 




20 


24 


Data 




252 



Table 237 Vendor MAD Fields 



Field Name 


Length 


Description 


Common MAD Header 


24 bytes 


Common MAD Header as described in 13.4.2 Manaaement Dataaram Format on 
oaae 574 


Data 


232 bytes 


Attribute data is mapped bit for bit from the fomnat described in the following sec- 
tions to the start of this data field. If the attribute is smaller than the data field, the 
content of the remainder of the data field is unspecified. 
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16.5.1.1 Status Field 



16.5.2 Methods 



16.5.3 Attributes 



The Status field is described in 13.4.7 Status Field on page 587 , Class- 
specific bits are defined by the Vendor. 

Table 238 Vendor Status Field 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.7 Status Field 






on oaae 587 


8-15 




Class-specific bits defined by Vendor 



Vendor classes supports the common methods 
Table 239 Vendor Class Methods 



Method Type 


Value 


Description 


VendorGetO 


0x01 


Request an attribute to return (read) from a target. 


VendorSetO 


0x02 


Request a target to set (write) an attribute. The object will 
issue a VendorGetRespO as a response. 


VendorGetRespO 


0x81 


Target responds to an attribute Get/Set request. 


VendorSendO 


0x03 


Send a datagram. Does not require a response. 


VendorlrapO 


0x05 


Unsolicited datagram sent to the vendor entity. Contains the Notice 
Attribute as defined in 13.4.8.2 Notice on oaae 591 to identifv the trap. 


VendorTrapRepress 


0x07 


Block repetition of notification. 



Vendor-specific nnethods can be added as desired by vendor, providing 
there is no collision with reserved methods in 13.4.5 Management Class 
Methods on page 577 . 



016-13: The Vendor classes shall support the ClassPortlnfo attribute. 

The Vendor classes may optionally support the attributes Notice and In- 
formlnfo. All other attributes are vendor-defined. 

Table 240 Vendor Class Attributes 



Attribute Name 


Attribute 
ID 


Attribute 
Modifier 


Description 


ClassPortlnfo 


0x0001 


0x00000000 


See 13.4.8.1 ClassPortlnfo on oaae 589 
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Table 240 Vendor Class Attributes 



Attribute Name 



Attribute 
ID 



Attribute 
Modifier 



Description 



Vendor defined 0x0011 - 0x00000000- 
OxFFFF OxFFFFFFFF 



Table 241 Vendor Attribute / Method Map 



Attribute Name 


VendorGet 


VendorSet 


VendorSend 


VendorTrap 


CtassPortlnfo 


x 


x 






Vendor defined 


Vendor defined 



16.5.3.1 ClassPortInfo 



16.6 Application-specific 



16,6.1 MAD Format 



The ClassPortInfo attribute is described in 13.4.8.1 ClassPortInfo on page 
589 . Class-specific bits are defined by Vendor. 

Table 242 Vendor ClassPortlnfo:CapabilityMask 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.8.1 Class- 
PortInfo on oaoe 589 


8-15 




Class-specific bits defined by Vendor 



The Application-specific Agents are optional. 

Application-specific operations can be defined using the application-spe- 
cific management classes. 

Applications are free to define new methods and attributes and their use, 
provided that they conform to the common MAD format and methods de- 
fined herein, and do not conflict with the stated restrictions on method and 
attribute utilization. 

Standard applications are defined in Volume 3. 



016-14: The datagrams in these Application-specific classes shall con- 
form to the MAD format and use as specified in 13.4 Management Data- 
grams on page 573 and further customized in Figure 178 Application MAD 
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Format on page 784 and Table 243 Application MAD Fields on page 784 1 
below. 2 

3 

Figure 178 Application MAD Format 
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Table 243 Application MAD Fields 



Field Name 


Length 


Description 


Common MAD Header 


24 bytes 


Common MAD Header as described In 13.4.2 Manaaement Dataaram Format on 
oaae 574 


Data 


232 bytes 


Attribute data Is mapped bit for bit from the format described in the following sec- 
tions to the start of this data field. If the attribute is smaller than the data field, the 
content of the remainder of the data field is unspecified. 
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16.6.1.1 Status Field 



The Status field is described in 13.4.7 Status Field on page 587 . Class- 
specific bits are defined by the Application. 

Table 244 Application Status Field 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.7 Status Field 
on paoe 587 


8-15 




Class-specific bits defined by Application 



16,6.2 Methods 



Application classes supports the common methods 
Table 245 Application Class Methods 



Method Type 


Value 


Description 


AppGetO 


0x01 


Request an attribute to return (read) from a target. 


AppSetO 


0x02 


Request a target to set (write) an attribute. The object will 
issue a AppGetRespO as a response. 


AppGetRespO 


0x81 


Target responds to an attribute Get/Set request. 
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Table 245 Application Class Methods 



Method Type 


Value 


Description 


AppSendO 


0x03 


Send a datagram. Does not require a response. 


AppTrapO 


0x05 


Unsolicited datagram sent to the application entity. Contains the 
Notice Attribute as defined in 13.4.8.2 Notice on oaae 591 to identify 
the trap. 


AppTrapRepress 


0x07 


Block repetition of notification. 



16.6.3 Attributes 



Application-specific methods can be added as desired by application, pro- 
viding there is no collision with reserved methods in 13.4.5 Management 
Class Methods on page 577 . 



016-15: The Application classes shall support the ClassPortlnfo attribute. 

The Application classes may optionally support the attributes Notice and 
Informlnfo. All other attributes are application-defined. 



Table 246 Application Class Attributes 



Attribute Name 



Attribute 
ID 



Attribute 
Modifier 



Description 



ClassPortlnfo 



0x0001 



0x00000000 See 13.4.8.1 ClassPortlnfo on pace 589 



Application 
defined 



0x0011 - 
OxFFFF 



0x00000000- 
OxFFFFFFFF 



Table 247 Application Attribute / Method Map 



Attribute Name 


AppGet 


AppSet 


AppSend 


AppTrap 


ClassPortlnfo 


x 


X 






Application defined 


Application defined 



16.6.3.1 ClassPortInfo 



The ClassPortlnfo attribute is described in 13.4,8.1 ClassPortlnfo on oaoe 
589 . Class-specific bits are defined by Application. 

Table 248 Application ClassPortlnfo:CapabilityMask 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.8.1 Class- 
Portlnfo on oaae 589 


8-15 




Class-specific bits defined by Application 
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16.7 Communication Management 



16.7.1 MAD Format 



CI 6-14: The Communication Management Agent is mandatory on all 
Channel Adapters. 

Communication Management is described in Chapter 12. Proper use of 
the messages defined in this section is subject to the protocols and state 
machines defined in that chapter. No semantics is described in this sec- 
tion. 



C16-15: The datagrams in the Communication Management class shall 
conform to the MAD format and use as specified in 1 3.4 Management Da- 
taarams on page 573 and further customized in Figure 179 Communica- 
tion Management MAD Format on page 786 and Table 249 
Communication Management MAD Fields on page 786 below. 



Figure 179 Communication Management MAD Format 
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Table 249 Communication Management MAD Fields 



Field Name 


Length 


Description 


Common MAD Header 


24 bytes 


Common MAD Header as described in 13.4.2 Manaaement Dataaram Format on 
paae 574 


Data 


232 bytes 


Attribute data is mapped bit for bit from the format described in the following sec- 
tions to the start of this data field. If the attribute is smaller than the data field, the 
content of the remainder of the data field is unspecified. 
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16.7.1.1 Status Field 



The Status field is described in 13.4.7 Status Field on page 587 . No class- 
specific bits are defined. 

Table 250 Communication Management Status Field 



Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.7 Status Field 
on pace 587 
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Table 250 Communication Management Status Field 



Bits 


Name 


Meaning 


8-15 




Class-specific bits are reserved 



16.7.2 Methods 



The Communication Management class supports the methods identified 
in Table 251 Communication Management Methods on page 787 below. 

Table 251 Communication Management Methods 



Method Type 


Value 


Description 


ComMgtGetO 


0x01 


Request a get (read) of an attribute 


ComMgtSetO 


0x02 


Request a set (write) of an attribute. 


ComMgtGetRespO 


0x81 


Response from a get or set request. 


ComMgtSendO 


0x03 


Send a connection management message. 



16.7.3 Attributes 



The Attributes/Attribute Modifiers specified in this section describe the 
mappings of message parameters defined section 12.4 Communications 
Services into the standard IVIAD header/payload format. The set of at- 
tributes supported by the Connection Management class is listed in Table 
252 Communication Management Attributes on page 787 . 



Table 252 Communication l\/lanagement Attributes 



Attribute Name 



Attribute 
ID 



Attribute 
Modifier 



Description 



ClassPortlnfo 



ConnectRequest 



MsgRcptAck 



ConnectReject 



0x0001 0x00000000 Refer to 13.4.8.1 ClassPortlnfo on page 

589 

0x0010 0x00000000 Refer to 12.6.5 REQ - Request for Commu- 
nication on page 522 



0x0011 0x00000000 Refer to 12.6.6 MRA - Message Receipt 

Acknowledgment on oaoe 524 



0x0012 



0x00000000 Refer to 12.6.7 REJ - Reject on page 525 



ConnectReply 



ReadyToUse 



DisconnectRequest 



0x0013 0x00000000 Refer to 12.6.8 REP - Reply to Request for 

Communication on paoe 528 

0x0014 0x00000000 Refer to 12.6.9 RTU - Ready To Use on 

page 529 



0x0015 0x00000000 Refer to 12.6.10 DREQ - Recuest for com- 
munication Release (Disconnection 
REQuest) on page 530 
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Table 252 Communication Management Attributes 



Attribute Name 



Attribute 
ID 



Attribute 
Modifier 



Description 



DisconnectReply 



ServicelDResReq 



0x0016 0x00000000 Refer to 12.6.11 DREP - Reply to Request 

for communication Release on page 530 

0x001 7 0x00000000 Refer to 12.6.5 REQ - Request for Commu- 
nication on page 522 



ServlcelDResReqResp 0x001 8 



0x00000000 Refer to 12.6.8 REP - Reolv to Request for 
Communication on page 528 



LoadAlternatePath 



0x0019 0x00000000 Refer to 12.8.1 LAP - Load Alternate Path 

on page 540 



AlternatePathResponse 0x001 A 



0x00000000 Refer to 12.8.2 APR - Alternate Path 
Response on page 541 



Table 253 Communication Management Attribute / Method Map on page 
788 indicates the methods with which each of the attributes is valid. 

Table 253 Communication Management Attribute / Method 

Map 



Attribute 


ConMgtGet ConMgtSet ConMgtSend 


ClassPortlnfo 


X x 


ConnectRequest 


X 


ConnectReply 


X 


ReadyToUse 


X 


MsgRcptAck 


X 


ConnectReject 


X 


DisconnectRequest 


X 


DisconnectReply 


X 


LoadAlternatePath 


X 


AlternatePathResponse 


X 


ServicelDResReq 


X 


ServicelDResReqResp 


X 



The normative definitions of the attribute components and the operational 
requirements and constraints applicable thereto are defined in. 
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Bits 


Name 


Meaning 


0-7 




Common bits as defined in 13.4.8.1 Class- 
Portlnfo on oace 589 


8 


IsMulticastCapable 


Multicast is supported 


9 


IsReliableConnectionCapable 


Reliable Connections are supported 


10 


IsReliableDatagramCapable 


Reliable Datagrams are supported 


11 


IsRawDatagramCapable 


Raw Datagrams are supported 


12-15 




Reserved 
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16.7.3.1 ClassPortInfo 1 

The ClassPortInfo attribute Is described In 13.4.8.1 ClassPortInfo on pace 2 
589 . In addition, bit 8 through 12 of the CapabllltyMask component are de- 3 
fined: 4 

5 

Table 254 Communication Management ClassPortlnfo:Capabilityll/lask g 
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Chapter 17: Channel Adapters 



17.1 Overview 



Host Channel 
Adapter (HCA) 



This section specifies the minimum requirements for an IBA channel 
adapter. Channel adapters (CA) are the source and terminus of IBA 
packets that traverse the IBA switching fabric. Channel adapters are ei- 
ther Host Channel Adapters (HCAs) or Target Channel Adapters (TCAs). 
In a typical system, the HCAs are used by the host processors to connect 
to the IBA fabric whereas the TCAs are used by an I/O adapter to connect 
to the IBA fabric. 

The key difference between a Host Channel Adapter and a Target 
Channel Adapter is in the way the client (whether the client is hardware or 
software) interfaces to the transport layer. Specifically, the HCA supports 
the IBA Verbs layer whereas the TCA uses an implementation dependent 
interface to the transport layer. 

017-1: An HCA shall support the IBA verbs layer interface. 

Previous sections of the specification have described the various layers 
comprising an IBA Channel Adapter (physical, link, network, and transport 
layers). All channel adapters share a common architecture for the phys- 
ical, link, network and transport layers. See Figure 180 below. From the 
point of view of the physical communications link, an HCA and TCA are 
identical. 



Intermediate Fabric Element, 
e.g. a Switch or Router (in- 
terface to management port 
is not shown) / 



Target Channel 
Adapter (TCA) 



< 

CD 
■D 



O 



Verbs Inttc. to Upper Levels 






Transport Layer 






Network Layer 




Network Layer 




Link Layer 




Link Layer 


Link Layer 


Physical 




Physical 


Physical 



Implementation Specific ULP* 


D 


Transport Layer 


\ ^ 

1 " 


Network Layer 


1 ^ 


Link Layer :'- 


r- 


Physical 




* ULP: Upper Layer Protocol 





Figure 180 IBA Architecture Layers 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 



InfiniBand^'*'' Trade Association 



Page 790 

Exhibit A, Amendment Under Rule 116 filed Dec. 21, 2007, 09/905,067 



InfiniBand™ Architecture Release 1.0 
Volume 1 - General Specifications 



Channel Adapters 



October 24, 2000 
FINAL 



This chapter lists the common functionality in all CAs as well as the differ- 
ences between HCAs and TCAs. There are also differences in required 
minimum functionality. These issues are addressed in the following sec- 
tions. 



17.2 Common Functional Requirements 
17.2-1 Multiple Ports per Channel Adapter 



A Channel Adapter'' may have one or more ports. A CAs port provides the 
physical, link and network protocol layers of the IBA CA. A channel 
adapter with multiple ports shares the transport layer functionality 
amongst the ports. For example, a QP (a transport layer construct) can be 
configured to work with any of the ports on the CA. The following figures 
show the physical representation as well as the architectural layering. 



Host Processor 
OR Target Adapter 



Process A ' ' Process B 



OP20 QP2I' 



QP22 



■D 
C 


i 










i 




(U 
CO 


8 












1 




cr 















a two ported Channel 
Adapter. The CA has 
one block of QPs, 
memory translation ta- 
bles etc. that interact 
with the IBA fabric 
through the two ports. 



Channel 
Adapter 
port port 

/ ^ 

Figure 181 MultiportCA 



Transport Layer Protocols 


Network 


Network 


Link 


Link 


Physical 


Physical 



A two ported CA 
showing a common 
transport layer and 
independent Net- 
worl<, Link and Phys- 
ical Layers for each 
port. 



Figure 182 Multiport CA Architectural Layers 



It is desirable for a channel adapter to have multiple ports for several rea- 
sons: 



1. Unless specifically mentioned, the term Channel Adapter refers to both an 
HCA and a TCA. 
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• Increased bandwidth from a single CA. e.g. an HCA with a 
high performance host memory interface can support the band- 
width of several IBA Wnks. By adding relatively low cost IBA ports 
the HCA can multiply its throughput with relatively little additional 
cost. See Figure 184 on pace 793 . 

Redundant paths for fault tolerant communications. In a sys- 
tem with multiple paths between source and destination, a CA's 
multiple ports may be used to tolerate faults in the fabric's switch- 
es and links. See Figure 185 on pace 793 and Figure 186 on 
pace 794 . 

• Support direct links to TCAs. in a low cost, switchless topology 
the ports of an HCA might be directly wired to TCAs. See Figure 
187 on page 794 . 

17.2.1.1 Topologies Supported With Multi-Ported Channel Adapters 

The following diagrams show several basic ways a multi ported CA could 
be attached to the rest of the system. These diagrams are by no means 
the only topologies supported ~ they are examples only. Note that multiple 
ports on a CA may either connect to multiple subnets or to the same 
subnet. 



CA 



Port Port 



.CA 



Port Port 



Port Port 




Each box representing a 
CA is expected to have a 
single blocl< of QPs along 
with the associated 
memory translation tables, 
QP state etc. 



Figure 183 Multiported OAs Connected to Single Subnet 



CI 7-2: A multiported CA shall be capable of connecting to one or more 
subnets. 
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CA 



Port Port 



CA 



Port Port 



CA 



Port 




In this topology not all 
ports can reach all desti- 
nation CAs. In this dia- 
gram there are three 
separate subnets. 



Figure 184 Multiported CAs Connected to Multiple Subnets 





In both these topologies any single failure in the 
switching fabric allows communication among all 
CAs to continue. 

Two cases are shown: symmetric fabrics and non 
identical fabrics 



Figure 185 Fault Tolerant Connections to Independent Fabrics 
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In this topology any single 
failure in the switching fabric 
leaves communication among 
all CAs intact. 



Figure 186 Fault Tolerant Connection to a Single Fabrics 
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Port 


Port 


Port 



Port Port Port Port 



Port 




TCA 






To other TCAs or 
Switches 



Figure 187 Multiported HCA with Direct Connections to TCAs 

17.2.1.2 Association of QPs with Ports 

CI 7-3: While a CA may have many QPs and many ports, each QP shall 
generate request packets, service returning response packets, and re- 
spond to arriving request packets through exactly one port, at any point in 
time. 

To ensure packet ordering within a QP for connected or reliable transport 
services, packets are required to take the same path between a source 
and destination. This requires that all requests and responses use a con- 
sistent port, base SLID, DLID, VL and SL. A connected or reliable trans- 
port QP remains bound with one port until path migration for error 
recovery or load balancing purposes occurs or the connection goes away. 
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Aside from valid packets requesting path migration as described in Sec- 
tion 17.2.8 Automatic Path Migration on page 804 , incoming request and 
response packets arriving at a port other than the port currently bound to 
the appropriate QP may be discarded. 

For an HCA using the unreliable datagram transport service, the verbs 
layer specifies the remote address with each outgoing work request. 
Since the QP is only bound to one port, the client of the verbs layer must 
be certain the destination is reachable from that port. In certain topologies 
not all destinations are reachable from all ports (see Figure 184 on page 
793 ). 

Incoming Unreliable Datagram packets may only target a QP if that QP is 
bound to the port on which the datagram arrived. 

The Reliable Datagram service uses an end-to-end context to ensure cor- 
rect delivery for every channel adapter with which it communicates. The 
EE context, like a Reliable Connection QP, is bound to one port (at least 
until path migration is used to rebind the EE context with a new port). But 
since a RD QP can communicate with multiple EE contexts, the RD QP 
can in effect be transmitting and receiving packets from multiple ports. 



17.2-1,3 Port Attributes and Functions 



Certain attributes and functions are associated with each port. Typically 
these belong to the physical, link, and network layers that are unique to 
the port. The table below itemizes these as well as describes some trans- 
port layer functionality unique to each port. Each attribute or function is in- 
tended to be applied individually to each port. 

Table 255 Port Attributes & Functions 



Attribute/Function 


HCA 


TCA 


Physical Interface 


The HCA and TCA shall support one or more of the 
IBA defined physical interfaces. (See the IBA Speci- 
fication, Volume 2) 


Static Rate Control (limiting the BW to a particular 
destination OA) 


required on ports supporting bandwidths above 2.5 
Gbps 


SuDDort for multioathina (see Section 7.11 Subnet 
Multlpathina on oaoe 185) 


required 


required 
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Table 255 Port Attributes & Functions 



Attribute/Function 


HCA 


TCA 


P_Key Checking on inbound Request and Inbound 
ResDonse Packets fsee Section 10.9 Partitionino on 
paae 427) 


Shall validate incoming packet's P_Key with the 
P_Key bound to the destination QP^. 
A CA shall maintain a P_Key table (see Section 
10.9.2 The Partition Kev Table (P Kev Table) on 
oaae 429) per port. Each table shall have at least 
two P_Key entries. 

HCA requires no OS involvement to set the P_Key 
(i.e. P_Key is set directly by a Subnet Manager con- 
trol packet.) 


Validation of incoming packet's DLID and, if the GRH 
is present. DGID 


required 


required 


Support for QPO and QP1 


required on each port 


required on each port 


Port Numbering 


Ports are numbered starting from one and if there 
are multiple ports, they are numbered sequentially. 
MADs use port number zero as a wild-card port 
number that matches whatever port the packet 
arrived at. See 14.2.5.6 Portlnfo on paae 633. 


GID Support 


Each port has at least one GID. The maximum num- 
ber of GIDs per port is implementation specific. See 
the discussion on GIDs in Section 4.1 Terminoloov 
And Conceots on oaae 109. 



a. excluding Raw Datagram QPs (because raw datagrams don't have a P_Key). Also PKey checking for 
QPO and QP1 are slightly different. See Section 10.9.8 Partition Enforcement on Management 
Queue Pairs on paae 433 for more information. 

C17-4: Static rate controls, as listed in Section 17.2.6 Static Rate Control 
on page 803 . are required on each port that supports a data rate above 
2.5 gbps. 

C17-5: Each port shall validate the incoming P_Key in an IB Transport 
packet with the P_Key bound to the destination QP (other than QPO and 
QP1). 

CI 7-6: The CA shall maintain a P_Key table per port supporting at least 
two P_Key entries. 

C17-7: An HCA shall require no OS involvement to set the P_Key table; 
the P_Key table shall be set directly by Subnet Manager MADs. 

C17-8: Each port shall support at least one GID. 
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17.2.1.4 Switching Packets through Multiple Ports 

If a Channel Adapter has multiple ports, the CA does not route packets 
from one port to the other. Such a packet forwarding function is defined as 
a switch. 

An implementation may choose to package a switch and multiple IBA 
ports together, as shown in the figure below. 



The overall box shows the 
boundary of the combined 
switch and Channel 
Adapter. This boundary 
could be a single IC or 
board. The boundary also 
represents the fault zone 
for that device. 




Figure 188 Multiple Single Ported CAs with an Embedded 

Switch 



17.2.2 Channel Adapter Attributes 



The previous section described attributes of a channel adapter's ports. 
This section describes attributes of the whole channel adapter. 

This specification only sets the minimum functionality of an HCA or TCA. 
For example, only two QPs are required (both for management). A prac- 
tical HCA or TCA would undoubtedly support more QPs, but this section 
only specifies architectural minimum requirements. The following summa- 
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rizes various required and optional Cliannel Adapters attributes (see also 1 
section 11 .2.1 .2 Query HCA on page 449 ). 2 
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Table 256 Channel Adapter Attributes 1 



Attribute 


HCA 


TCA 


Support for multiple ports. 


Optional 


Optional 


Source/Sink packets with a LRH (for communica- 
tion within the subnet) 


Required for all QPs. 


Source/Sink packets with a GRH (for communica- 
tion across subnets) 


Required for all QPs other than QPO. 


Transport Services Supported 


HCAs shall be capable of support- 
ing the Unreliable Datagram, Reli- 
able Connection, and Unreliable 
Connection transport service on 
any CP supported by the HCA. 


Aside from supporting Unreli- 
able Datagram for the two 
required management QPs, 
support for any other transport 
service (or QP) is optional. 
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Table 256 Channel Adapter Attributes 



Attribute 


HCA 


TO A 


Atomir Onpratlnns Sunnortf^ri 


Optional to generate requests or responses 


Other Operations Supported 


If a transport service is supported, 
then the CA must support all the 
operations defined for that trans- 
port service (excluding atomic 
operations). 


TCAs are not general purpose 
and may customize the opera- 
tions supported to suit their 
function (e.g. a TCA with Reli- 
able Connection Service may 
generate RDMA requests but 
not respond to incoming RDMA 
requests) 


Solicited Events (see Section 9.2.3 Solicited Event 


Required to both generate solic- 
ited events and to receive them. 


Optional 


(SE) - 1 bit on pace 202 and Section 11.4.2,2 


Request Comoletion Notification on oaae 506 


Path MTU 


CAs shall support one of the following sets of MTUs (for all Trans- 
port Service Classes): 

• 256 Bytes 
•256, 512 Bytes 
•256.512, 1024 Bytes 
•256,512, 1024, 2048 Bytes 

• 256, 512, 1024, 2048, 4096 Bytes 


For UD and Raw the WOE deter- 
mines the MTU. 

For RD the EE context specifies 
the MTU. 

For RC and UC the OP specifies 
the MTU. 


Selection of MTU for TCAs is 
implementation specific. 


End-to-End Flow Control (reliable connection 
transport service only) 


HCA receive queues shall gener- 
ate E-to-E flow control credits 

• i.e. HCAs throttle inbound 
requests to prevent inbound 
Sends arriving at an empty 
receive queue. 

HCA send queue shall receive 
and respond to inbound credits 

• i.e. remote node may throttle the 
HCA's outbound requests. 


TCA receive queues may gener- 
ate E-to-E flow control credits. 

• i.e. TCA need not throttle 
inbound requests. 

TCA send queue shall receive 
and respond to inbound credits. 

• i.e. remote node may throttle 
the TCA's outbound requests. 


Multicast 


Generating IBA Raw Multicast packets is optional. 

Receiving IBA Raw Multicast packets is optional. 

Generating IBA Unreliable Datagram Multicast packets is optional^. 

Receiving IBA Unreliable Datagram Multicast packets is optional. 


Automatic Path Migration 


It is optional to either generate or respond to an automatic path 
migration request. 


Memory Protection 


HCA's provide memory protection 
as described In Section 10.6 
Memorv Management on pace 


Optional 


399. 
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Table 256 Channel Adapter Attributes 



Attribute 


HCA 


TCA 


Loopback Support 


Self addressed packets^ shall be 
allowed and shall not go out onto 
the wire. That is, self addressed 
packets must work even If no 
external switch is present 


Optional 
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a. It is expected that any implementation of the Unreliable Datagram transport service will trivially support the 

generation of multicast packets. 

b. A self-addressed pacl<et is a packet whose DLID and SLID (while not necessarily identical) address the same 

CA. A self-addressed packet may or may not have the same source and destination QP. IB does not define 
a specific "loopback" address. 

C17-9: All channel adapters shall be able to source and sink (to all QPs) 
locally routed packets (I.e. no GRH). 

C17-10: All channel adapters shall be able to source and sink (to all QPs 
other than QPO) globally routed packets (i.e. packets with a GRH). 

C17-11: HCAs shall be capable of supporting the Unreliable Datagram, 
Reliable Connection, and Unreliable Connection transport service on any 
QP supported by the HCA. 

017-12: If a transport service is supported by an HCA, then that HCA 
must support all the operations defined for that transport service (ex- 
cluding atomic operations). 

017-13: An HCA shall be able to generate and receive solicited event. 

017-14: CAs shall support one of the following sets of MTUs (for all Trans- 
port Service Classes): 
256 Bytes 
256, 512 Bytes 
256,512. 1024 Bytes 
256, 512, 1024, 2048 Bytes 
256, 512, 1024, 2048, 4096 Bytes 

017-15: HCA receive queues shall generate E-to-E flow control credits. 

017-16: HCA send queue shall receive and respond to inbound E-to-E 
flow control credits. 

o17-1: TCA receive queues may generate E-to-E flow control credits. 

017-17: TCA send queue shall receive and respond to inbound E-to-E 
flow control credits. 

o17-2: A CA may be capable of generating multicast packets. 
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o17-3: A CA may be capable of receiving multicast packets. 1 

2 

o17-4: A CA may be capable of generating and responding to the Auto- 3 
matic Path Migration protocol. ^ 

C17-18: HCAs shall allow packets with a destination address the same as 
that of the port on which the packet is issued. Such a loopback packet ^ 
shall not go onto the wire. 7 

8 

17.2.3 Deadlock Prevention 9 

Each CA shall not cause deadlock in the fabric. This condition is met by 10 

11 

• The CA will not continuously and permanently assert backpres- -|2 
sure (i.e. fail to grant link credits). ^ 2 

• The CA shall not assert backpressure on a port's inbound link as -14 
the result of receiving backpressure on that port's outbound link. 

017-19: For deadlock prevention, the CA shall not continuously and per- 16 
manently assert backpressure (i.e. fail to grant link credits). 17 



Other uses of the nonvolatile memory are optional. 



18 
19 



01 7-20: For deadlock prevention, the CA shall not assert backpressure 
on a port's inbound link as the result of receiving backpressure on that 
port's outbound link. 20 

21 

17.2.4 Checking Incoming Packets 22 

All CA's are required to validate each incoming packet before committing 23 
the packet to the CA's state. 24 

25 

01 7-21 : The CA shall check for link, network and transport layer errors in 26 
all incoming packets. 27 

17.2.5 Non-Volatile STATE ^8 

29 

017-22: All channel adapters shall maintain a EUI-64 port QUID and a 
EUI-64 CA QUID (See Chapter 4: Addressing on page 108 ) in nonvolatile 
memory such that the GUID is the same each time the CA is powered on. 

32 

017-23: Any CA that can become a Subnet Manager (see Section 14.2 33 
Subnet Management Class on page 611 ) shall also keep its Subnet 34 
Manger Priority in nonvolatile memory. 35 

36 
37 

The type of non volatile memory in a CA is not specified and might be a ^8 
local disk drive or on-chip memory. 39 

40 

IBA does not require a CA to remember connection state information 4^ 
across power cycles. ^2 
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17.2.6 Static Rate Control 



A CA shall support static rate control (see section 9.11 Static Rate Control 
on pace 365 ) if its raw bandwidth is greater than 2.5 Gbps. The Inter- 
packet Delay (IPD) values supported (see Table 63 on page 367 ) nnust 
allow slowing the packet rate to all of the standard link rates. The table 
below indicates the values of IPD that shall be supported 

Table 257 Static Rate Control IPD Values 



IPD 


Multiplier 


rate 


Comment 


0 


0 


100% 


Required by all OAs 


3 


3.00 


25% 


Required by OAs that sup- 
port 1 GB/s or higher link 
rate 


2 


2.00 


33% 


Required by OAs that sup- 
port 3 GB/s or higher link 
rate 


11 


11.00 


8% 


Required by OAs that sup- 
port 3GB/S or higher link rate 



17.2.7 Management Messages 



Each port of every channel adapter shall support two QPs for manage- 
ment commands: 

• QPO, used by the Subnet Management Agent for sending and re- 
ceiving Subnet Management Packets (SMPs). 

• This QP uses the Unreliable Datagram transport service. 

• SMP packets arriving before the current packet's command 
completes may be dropped (i.e. the minimum queue depth of 
QPO is one). 

• QP1, used for the General Services Interface (GSI). 

• This QP uses the Unreliable Datagram transport service. 

• All traffic to and from this QP uses any VL other than VL1 5. 

GSI packets arriving before the current packet's command 
completes may be dropped (i.e. the minimum queue depth of 
QP1 is one). 

C17-24: Each port of every CA shall support QPO for use by the SMA and 
QP1 for use by the GSI. 

All QPs for a given CA, except QPO and QP1 have unique numbers. QPO 
and QP1 are special in that each port has its own QPO and QP1 . 

The rest of the (non RD) QPs on a CA may be bound with any one port. 
The binding of a QP (other than QPO or QP1 ) with a port is maintained 
until such time that automatic path migration (see 17.2.8 Automatic Path 
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17.2.7.1 Subnet Management 



17.2.7.2 General Services 



Migration on page 804 ) or path migration requested by IVlADs associates 
the QP with a different port. 

The management QPs are special because they are used by the Subnet 
Manager and other management applications. See Section 13.5.1 MAD 
Interfaces on oaoe 600 . 

Since each port may be on a different subnet it must communicate with a 
different Subnet Manager and related management application, The 
Subnet Manager and other nodes using the GSI use the well known QP 
numbers (0 and 1) to establish communication. 



All CAs shall respond to incoming Subnet Management Packets from the 
Subnet Manager. CAs shall also generate the required traps defined as 
part of the SMA. 

The IBA does not require nor preclude a CA from being a Subnet Man- 
ager. If a node does host a Subnet Manager, it must meet the require- 
ments as specified in section 14.4 Subnet Manager on page 655 . 



All CAs shall respond to mandatory GSI MADs defined in Chapter 16: 
General Services on page 717 . Any HCA or TCA may initiate MADs to an- 
other CA. 



17.2.8 Automatic Path Migration 



The reliable or connected transport services (Reliable Connection, Reli- 
able Datagram, and Unreliable Connection) use the same path for a given 
connection (or in the case of RD for a given pair of end-to-end contexts). 
This ensures data is delivered in the proper order. Path migration refers to 
the requestor and responder agreeing to use a new path. The source and 
destination QPs remain the same but the ports and path through the fabric 
may change. Path migration may be used to recover from a bad path 
(sometimes this Is referred to as Failover) or for other reasons such as 
load balancing. 

Automatic path migration may be supported by HCAs and TCAs. If sup- 
ported, Automatic path migration works for QPs using the RC, RD, and 
DC transport services. 

Automatic path migration provides a fast mechanism for path migration. 
When a connection is established the two CA's use Communication Man- 
agement MADs to establish the primary and alternate path (See sections 
10.4 Automatic Path Migration on page 393 and 12.8 Alternate Path Man- 
agement on page 539 . Automatic path migration is a mechanism whereby 
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either CA can signal the other to switch from the prinnary path to the alter- 1 
nate path. 2 

3 

At connection establishment time the CA is set with the following informa- ^ 

tion to determine a path: _ 

5 

• DLID of the responding CA ^ 

• Destination GID of the responding CA ^ 

• SL 9 

• source port (i.e. the base SLID and path bits for outbound request i q 
and response packets) ^ ^ 

At connection establishment the CAs may be given two sets of path infor- 1 2 
mation, one for the primary path and another for the alternate path. The ^ 3 
alternate path may use the same or different source and destination ports ^ ^ 
as that used by the primary path. 

1 5 

17.2.8.1 Automatic Path Migration Protocol 

17 

The automatic path migration protocol uses a single bit in the BTH called 
MigReq (Migrate Request) and a 3-state state machine associated with 
each connected QP supporting automatic path migration. If automatic 19 
path migration is not supported by either QP of the connection, the state 20 
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machines of the two QP's remain in the MIGRATED state.See figure 
below. 



ARMED 

Outbound Packet: 
MigReq = FALSE 



Local node has decided to request a 
path migration. In an HCA, Path Migra- 
tion may be requested by the verbs 
client (using the ModifyQP/EE verb) or 
by the verbs layer. In a TCA, requesting 
Path Migration is implementation spe- 
cific OR Inbound Packet MigReq = 
TRUE 



Initial State 



or 

Othenvise / \ , 




MIGRATED 

Outbound Packet: 
MigReq = TRUE 



Otherwise 



Inbound Packet: 
MigReq = False 



7 



Upon entry to the MI- 
GRATED state, the vari- 
ables used by the QP 
logic for setting the out- 
bound path and vali- 
dating the inbound path 
are loaded with the alter- 
nate path state. 



Inbound Packet: 
MigReq = True 




REARM 

Outbound Packet: 
MigReq = FALSE 



A MAD has loaded (by receiving 
a REQ or REP MAD) or reloaded 
,(by receiving a LAP or APR 
MAD) alternate path information 
and enabled transition to ReAnn 



Figure 189 Automatic Path Migration State {Machine (per QP) 



17.2.8.1.1 Initialization 



17.2.8.1.2 Migration Request 



At connection setup time, the primary and alternate path states are as- 
signed to each CA. 

The Automatic Path l\/ligration State IVIachine is initialized to the MI- 
GRATED state whether or not Automatic Path IVIigration is supported by 
either QP of the connection. 



Either CA may request an automatic path migration. Reasons for re- 
questing automatic path migration are outside the scope of the IBA spec- 
ification but may include load balancing or using an alternate path to 
recover from excessive errors. 

The CA requesting automatic path migration transitions its state machine 
to the MIGRATED state. Once in the MIGRATED state the CA generates 
all new packets (both request and response packets) using the path that 
was previously initialized as the alternate path. The CA may refuse to ac- 
cept incoming request or response packets arriving from the original path. 
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17.2.8.1.3 Migration Response 



17.2.8.1.4 Re-enabling Migration 



Once in the MIGRATED state, all outbound packets (both requests and re- 
sponses) on that QP have the MigReq (Migrate Request) bit in the BTH 
set to TRUE. 



A CA whose QP is in the ARMED state that receives a pacl<et (either re- 
quest or response packet) with the MigReq bit set validates the incoming 
packets path bits with the alternate path information (i.e. checks the SLID, 
DLID, SGID, DGID against the alternate path state). If the validation 
passes the QP transitions to the MIGRATED state. 

Upon entry to the MIGRATED state, the primary path used by the QP logic 
for setting the outbound path and validating the inbound path are loaded 
with the alternate path state. At this point all request and response 
packets from both CAs are using the alternate path. 



Migration is re-enabled via management intervention. First the alternate 
path variables are reloaded with new alternate paths. Then, based on a 
command from a management entity, the QP state is set to REARM. This 
causes the MigReq bit in outbound packets to be set false. Upon receiving 
an inbound packet with MigReq set false, the QP state is set to ARMED. 
Migration at this point is now re-enabled. 



17.3 Host Channel Adapter 



17.3.1 LOOPBACK 



A HCA is differentiated from a TCA in that it supports the architecturally 
defined IBA Verbs Layer. As such, an HCA (and its vendor specific and 
OS specific driver SW) shall support the functionality of the Verbs Layer 
chapter. 



An HCA shall be able to internally loopback a packet sent to itself. That is, 
the verbs layer can specify a packet to be delivered to the same port (pos- 
sibly a different QP though). The packet shall be delivered without the 
packet appearing on the port's physical link. This loopback shall be able 
to function without requiring the presence of an external switch. Further- 
more there is no special loopback address required. 

On an HCA with multiple ports, a packet may be sent onto the wire from 
one port with the DLID in the packet targeting a different port. This is not 
considered loopback and follows all the normal rules for sending packets. 
An external switch is required for such a packet transfer, there is no re- 
quirement that a packet be routed internally from one port to another. 

Loopback packets for diagnostic purposes that traverse an external 
switch are performed by using the directed routed subnet management 
packets. 
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17.4 Target Channel Adapter 



A channel adapter that attaches an I/O node to the fabric is a Target 
Channel Adapter (TCA). In most regards, a TCA is indistinguishable from 
an HCA when viewed from the perspective of the IBA wire semantics. 
However, there are certain characteristics and requirements that distin- 
guish a TCA from an HCA. This section describes some of the differences 
between a target channel adapter and a host channel adapter, specifies 
functionality required of a TCA, and specifies minimum requirements on a 
TCA. 

This section also describes the role of the target channel adapter in sup- 
porting its clients. Figure 190 illustrates the relationship of the target 
channel adapter in an I/O node. The client of the target channel adapter's 
services is one or more I/O controllers. 

I/O Node 




TCA 



I/O Controller 



I/O Coritrbller 



o 

0 



I/O Controller , 



n 



I/O Ports or Devices 



Figure 190 Generic I/O Node Model 
17.4.1 Contrast to a Host Channel Adapter 

Unlike a host node, the execution environment for an I/O node is not nec- 
essarily associated with a general purpose processor. In fact, it can be en- 
tirely in hardware without any software environment. 

For a host channel adapter, IBA specifies the semantics of the client inter- 
face characteristics (i.e., verbs) in order to support run time binding with 
the host's operating system and allow each component (HCA, OS, appli- 
cation) to be architected and distributed independently. But a target 
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channel adapter can be bound to the I/O controller as part of the design 
process and distributed together. Thus the architecture does not specify 
any particular relationship between the target channel adapter and the I/O 
controller. This freedom promotes diversity and the ability to employ any 
queuing and notification mechanism that best serves the I/O function. 



In the host environment, as illustrated, IBA service is separated into layers with the 
HCA hardware and the HCA driver being referred to as the host channel adapter. 
Thus the IBA services are not included in the requirements for the HCA. Instead, re- 
quirements for IBA services are applied to the host platform in general and not the 
HCA vendor. 



Subnet 



Subnet 



IBA Services 



Device 





Message & 



Connection 



^ SAV 



HCA Driver 



HCAHAV ^ HAV 

Figure 191 Host Environment - Split Responsibility 
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17.4.1.1 Memory Protection 



In the target environment, as illustrated, IBA services are not separated from TCA 
channel functionality, and thus the target channel adapter includes the IBA services. 
Thus the term target channel adapter is abstracted to mean all of the IBA mecha- 
nisms In the I/O node (target). TCAs may be implemented using software and hard- 
ware or hardware alone. 




Figure 192 Target Environment - TCA Responsibility 



A host channel adapter provides a generic service to its application "cli- 
ents". Therefore, IBA requires that an HCA provide full channel function- 
ality. This is because the HCA vendor does not have prior knowledge of 
what applications will run over its channels. 

However, since a target channel adapter vendor may have prior knowl- 
edge of the way the target channel adapter will be applied, the hardware 
vendor can reasonably restrict a TCA's capabilities to only what is neces- 
sary for its clients. 



IBA does not require that a TCA make any of its memory, or the memory 
associated with its attached I/O controllers directly accessible to an an- 
other channel adapter. That is, there is no requirement that a TCA be ca- 
pable of accepting inbound Atomic or RDMA READ or WRITE requests. 
If the TCA does expose its memory, or that of its attached I/O controllers, 
the architecture does not require that the TCA provide any form of 
memory protection, nor prescribe any particular mechanism for regis- 
tering or protecting access to that memory other than the mechanism pro- 
vided in the transport layer. 



17.4.2 Device Administration 



Device administration packets allow an I/O node's resources to be discov- 
ered and managed. In particular they provide the ability for a host to dis- 
cover and invoke I/O services provided by I/O nodes. 
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17.4,3 Fabric Loopback 



5 



A key distinction between a host and an I/O node lies in the method by 1 

which the I/O node's resources and capabilities are discovered, and the 2 

method by which connections to the TCA, and hence to the I/O resources 3 

on the node, are established. A complete set of messages is defined in ^ 
Section 16.3 Device Management on page 761 for the purpose of allowing 
a consumer of the I/O node's services to discover the range of services 

offered by the I/O node. 6 

7 

Since a TCA is not required to support all IBA transport services, a partic- 8 

ular TCA has associated with it a set of attributes defining its capabilities 9 
and the services it supports. Normally, these attributes are discovered by 
negotiation between peer channel adapters during the process of estab- 
lishing a connection. 

12 

In the case of an I/O node which does not necessarily have the compute 13 

power and resources to participate in a complex negotiation, IBA defines 14 

a simple method by which a host or other intelligent I/O node can discover 1 5 

the target's attributes and establish connections accordingly. ^ g 



17 
18 



Thus, IBA defines a rich set of I/O node attributes that can be read by an 
intelligent channel adapter and used during connection establishment in 
order to free the target from complex connection negotiation protocols. ^ ^ 
The host discovers target attributes directly, thus avoiding negotiation 20 
during connection establishment. 21 

22 

Each target must support the set of target/10 device attribute discovery 23 
messages as defined in Section 16.3.3 Attributes on pace 765 . ^4 

IBA does not specify the semantics nor methods between a TCA and its 
clients for conveying device information but it does specify the mecha- 26 
nisms and encoding for conveying that information. 27 

28 

IBA does not specify the semantics nor queueing models for the I/O con- 29 
troller to post messages, receive messages, and invoke RDMA Read, 
Write, and Atomic operations between a TCA and its clients. 

o I 

IBA is I/O protocol agnostic. That is, how an I/O controller chooses to ^2 
apply the services provided by the TCA is outside the scope of IBA as long 33 
as the usage corresponds to the rules for class of service and quality of 34 
service. 35 

36 
37 

A TCA does not have the same internal loopback requirement as does the 33 
HCA. Being a special purpose device, how the TCA handles packets ad- 
dressed to itself is an implementation specific decision. 

41 
42 
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Loopback for diagnostic purposes that traverses an external switch is per- 1 
formed by using directed routed subnet management pacl<ets (just as 2 
done by the HCA) 3 
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Chapter 18: Switches i 

2 

3 



18.1 Overview 



Packets may be forwarded within a subnet (intra-subnet) and between 

subnets (inter-subnet). IBA switches are the fundamental forwarding com- 
ponent for intra-subnet routing (inter-subnet routing is provided by IBA 
routers, described later in this specification). Switches interconnect links 
by forwarding packets between the links. 



6 

This chapter specifies the requirements related to IBA switches. 7 

8 
9 

10 
11 
12 
13 

Switches are transparent to the end stations and are not directly ad- 14 
dressed (except for subnet management operations). To this end, every -| 5 
destination port within the network is configured with one or more unique 
Local Identifiers (LID's). From the point of view of a switch, a LID repre- 
sents a path from the input port through the switch. Switch elements are 
configured with forwarding tables. Packets are addressed to their ultimate 
destination on the subnet using a destination LID (DLID), not to inter- 19 
vening switches. Individual packets are forwarded within a switch to an 20 
outbound port or ports based on the packet's DLID field and the Switch's 21 
forwarding table. 22 



17 
18 



23 
24 



30 
31 



IBA switches are required to support unicast forwarding and may support 
multicast fonA^arding. In addition, IBA switches support a form of source 

routing, referred to as Directed Routing, for forwarding subnet manage- 25 

ment packets. This enables the configuration of a subnet without valid for- 26 

warding entries in the switches (e.g. a subnet power-up). 27 

28 

A Subnet Manager (SM) configures switches including loading their for- 29 
warding tables. The entity that communicates with the SM for the purpose 
of configuring the switch is referred to as the Subnet Management Agent 
(SMA). Every switch is required to have a subnet management agent. In- 
dividual switches within a power domain can be made observable to the 32 
SM via multiple instantiation of SMAs. Likewise, an SMA can be con- 33 
structed that configures multiple switches and exports the multiple 34 
switches to the SM as a single switch; however, from the SM's perspec- 35 
five, such a configuration is a single switch. 

Switches must also support a Subnet Management Interface (SMI) as 

specified in Chapter 14: Subnet Manaaement on pace 610 and a General 38 

Services Interface (GSI) as specified bv Chapter 16: General Services on 39 

page 717 . There are various mandatory and optional requirements of 40 

these interfaces that are specified in the respective chapters. 41 

42 
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18.2 Detailed Functional Requirements 
18.2.1 Attributes 



This section describes the major architecturally defined attributes of 
switches that are left as implementation choices. 

Unicast FonA^arding Table: 

C18-1: For the forwarding of unicast packets, a switch shall implement ei- 
ther a linear fonwarding table or a random forwarding table, but not both. 

C18-2: A switch shall implement a unicast forwarding table with at least 
one entry and no more than 49,152 entries. 

Two forms of an unicast forwarding table are defined: linear and random. 
All switches support one and only one of these forwarding table types. In 
either case, the required size for the unicast fonA/arding table is not spec- 
ified by IBA and may vary between implementations. However, a valid 
range of table sizes is specified. Switches that implement the random 
form may also choose to limit the number of entries that may be assigned 
to a given port. This is further described in section 18.2.4.3 Packet Relav 
on page 819 . 

Multicast Support: 

o1 8-1 : The replication of multicast packets to multiple ports by switches is 
optional. 

o18-2: A switch that implements the switch multicast replication service 
shall implement a multicast forwarding table with at least one entry and no 
more than 16383 entries. 

IBA defines a switch multicast service that provides for the replication of 
packets by switches and their subsequent fonwarding to multiple ports. 
The implementation of this service is optional. If implemented, IBA does 
not specify the size for the multicast forwarding table, and therefore the 
number of multicast groups a switch is capable of supporting. Conse- 
quently, the size of this table may vary by implementation. However, a 
valid range of table sizes is specified. Additional multicast requirements 
are specified in section 18.2.4.3.4 Optional Multicast Relav on page 825 . 

Virtual lanes: 

C18-3: Switches shall Implement the subnet virtual lane (also referred to 
as virtual lane 15). 

o18-3: Switches may implement a single buffer resource shared by all 
ports for the subnet management virtual lane. 
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o18-4: Switches that implement more than one data virtual lane shall im 
plement the SL to VL mapping function specified in this chapter. 



All switches implement the subnet management virtual lane (which is 1 
numbered virtual lane 15). Additionally, switches implement one, two, 2 
four, eight, or 15 data virtual lanes. These virtual lanes are numbered se- 3 
quentially starting with zero. Unlike data virtual lanes, buffering for virtual ^ 
lane 1 5 may be shared by all ports and may be shared by packet reception 
and transmission. This is described in 7.6 Virtual Lanes Mechanisms on 
page 146 . 6 

7 

SL to VL mapping: 8 

9 

10 
11 

o18-5: Switches that implement one data virtual lane may implement the 
SL to VL mapping function specified in this chapter. 13 

14 

SL to VL mapping is required on switches that support more than one vir- 15 
tual lane in addition to virtual lane 15. It is optional on switches that sup- 
port only one virtual lane in addition to virtual lane 15. The specific 
requirements of this table are described in section 7.6.6 VL Mapping 
Within a Subnet on page 152 . 

19 

P_Key Enforcement: 20 

21 

01 8-6: Switches may implement the Inbound P_Key Enforcement Service 22 
specified in this chapter. 23 

o18-7: Switches may implement the Outbound P_Key Enforcement Ser- 
vice specified in this chapter. 

26 

Switches may enforce partitions on ingress to and/or egress from the 27 
switch. This mechanism is described in sections 18.2.4.2.1 Inbound 28 
P Kev Enforcement on page 818 and 18.2.4.4.1 Outbound P Kev En- 29 
forcement on page 826 . 

Maximum Transfer Unit (MTU) size: 

32 

C18-4: Switches shall be capable of forwarding packets of size from the 33 
minimum valid packet up to 382 bytes on the management virtual lane. 34 

35 

C18-5: Switches shall support one of the MTU sizes specified in Table 19 
Packet Size on pace 161 across all ports on the switch. 

C18-6: With the exception of packets arriving on the management virtual 
lane, switches shall be capable of forwarding packets of size from the min- 39 
imum valid packet up to the supported MTU plus 126 bytes. 40 

41 
42 
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18.2.2 Initialization 



18.2.3 Configuration 



Table 19 Packet Size on page 161 specifies a choice of MTU that may be 
supported by IBA devices. Switch implementations support one of the 
specified MTU sizes for the entire switch. Switches are capable of for- 
warding packets whose size varies up to the maximum size indicated in 
the table for the implemented MTU size plus an additional 126 bytes. 

Link Physicals: 

IBA specifies various physical layer options. Switches may implement any 
of these options on any port and there is no requirement that all ports of a 
switch implement nor operate with the same physical options. Switches 
conform to the detailed requirements for physical layer support as speci- 
fied in Chapter 6: Physical Laver Interface on oaae 130 . 



C18-7: Upon power-up, a switch shall be initialized to the following state: 

• All initialization of attributes as required in Chapter 14: Subnet Man- 
agement on page 610 . 

• Physical and link layers shall be reset. 

• All virtual lane queues shall be cleared. 

• P_Key enforcement, if implemented, shall be disabled for all ports. 

• The NeighborMTU component of each Portlnfo attribute shall be ini- 
tialized to indicate 256 byte MTU as specified in 14.2.5.6 Portlnfo on 
page 633 . 

Note that a switch contains many tables, some of which are optional. 
These include the forwarding table, the SL to VL mapping table, the mul- 
ticast forwarding table, P_Key tables, etc. There is no requirement for a 
switch to initialize any of these tables; the subnet manager is responsible 
for appropriate initialization. 



Switches are configured via a subnet manager. Switches support the re- 
quired subnet management operations and may support the optional 
subnet management operations specified in Chapter 14: Subnet Manage- 
ment on page 610 . 



18.2.4 Packet Relay Requirements 



The primary function of IBA switches is the relay of packets between links. 
This section specifies the requirements for supporting this function. This 
section assumes normal operation; required operation under error condi- 
tions is specified in section 18.2.5 Error Handling on page 827 . 

To simplify the explanation of switch requirements, this section is divided 
into several architectural functions. This division does not imply a partic- 
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18.2.4.1 Switch Ports 



18.2.4.2 Receiver Queuing 



ular implementation; it is done solely to enhance the organization of the 
specification. 



CI 8-8: Each port on an IBA Switch except port 0 shall comply with the 
physical layer requirements specified in Chapter 6: Physical Laver Inter- 
face on page 130 . 

CI 8-9: Each port on an IBA Switch shall comply with the link layer require- 
ments specified in Chapter?: Link Laver on page 134 of this specification. 

CI 8-10: Port number 0 shall be reserved for the forwarding of packets to 
and from the switch's Subnet Management Interface and General Ser- 
vices Interface. 

CI 8-11: Port number 0 shall comply with the requirements of Chapter 9: 
Transport Laver on page 196 related to unreliable datagram service. 



0I8-8: Port 0 shall adhere to all IBA switch port requirements specified in 
this chapter with the exception that it may deviate from these require- 
ments in any combination of the following ways: 

Port 0 is not required to be physically instantiated. 

• Port 0 is not required to implement the IB physical layer electrical, op- 
tical, or mechanical requirements. 

• Port 0 is not required to implement IB link level flow control. 
C18-12: Port 0 shall assume an LMC value of 0. 

CI 8-1 3: A set of the LMC component of the Portlnfo attribute referencing 
port 0 shall be ignored. 

CI 8-1 4: All get responses of the Portlnfo attribute for port 0 of a switch 
(including a get response initiated in response to a set operation) shall in- 
clude a value of 0 for the LMC component. 

Port 0 is assigned a LID similar to that of channel adapters; however, un- 
like channel adapters, this port does not support multipathing and an LMC 
value cannot be assigned. The LID is assigned using the LID component 
of the Portlnfo attribute. Refer to 14.2.5.6 Portlnfo on pace 633 for details 
on these requirements. 



The receiver queueing function receives packets from the link layer de- 
fined in Chapter 7: Link Laver on paoe 134 . 
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C18-15: The virtual lane into which an individual packet is queued shall 1 
be the one corresponding to the VL field in the packet's Local Route 2 
Header. 3 

4 

C18-16: If the FilterRawlnbound component of the receiving port's Port- 
Info Attribute is set to one, then the switch shall discard all packets re- 
ceived on that port in which the LNH field of the LRH contains binary 00 ^ 
or binary 01 (i.e. raw packets). 7 

8 

C18-17: Switches shall not discard packets in lieu of implementation of g 
the link level flow control as specified in section 7.9 Flow Control on page ^ q 
175. 

11 

18.2.4.2.1 Inbound P_Key Enforcement 12 

The implementation of the inbound P_Key enforcement service in 

switches is optional. This section specifies the requirements of the ser- 14 

vice if implemented. 1 5 

16 

Inbound P_Key verification is enabled and disabled for each port Individ- 
ually based on the PatrtitionEnforcementlnbound component of the Port- 
Info attribute. 

19 

o18-9: If a switch provides the inbound P_Key enforcement service and 20 
the PartitionEnforcementlnbound component of the Portlnfo Attribute is 21 
set to zero, then the inbound P_key enforcement service shall be disabled 22 
for packets received on the corresponding port. 23 

24 
25 
26 

0I8-II : If a switch provides both the inbound P_Key enforcement service 27 
and the outbound P_Key enforcement service, then the list of P_Keys as- 28 
sociated with each port shall be the same list for both the inbound P_Key 29 
enforcement service and the outbound P„Key enforcement service. 3Q 

31 
32 



0I8-IO: If a switch provides the inbound P_Key enforcement service, it 
shall maintain a separate list of P_Keys associated with each port. 



018-12: If a switch provides the inbound P_Key enforcement service, the 
P_Key table associated with each port shall be capable of containing be- 
tween one and 65535 P_Keys, inclusive (the exact number is left as an 33 
implementation parameter). 34 

35 

01 8-1 3: If a switch provides the inbound P_Key enforcement service, the 35 
P_Key table associated with each port shall be programmable using the 
P_KeyTable attribute defined in 14.2.5.7 P KevTable on page 642 . 

38 

o18-14: If a switch provides the inbound P_Key enforcement service and 
if the PartitionEnfocementlnbound component of the Portlnfo Attribute is 40 
set to one, then any packet received on a virtual lane other than 15 shall 41 
either be discarded or truncated such that it contains no data past the BTH 42 
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if the value in the P_Key field in the BTH is not contained in the receiving 1 
port's P_Key list and either of the following conditions are true: 2 

3 

• LNH field in the LRH contains binary 11 and IPVer field in the GRH ^ 
contains 6. 

5 

• LNH field in the LRH contains binary 10. g 

018-15: If a switch provides the inbound P_Key enforcement service and 7 

if the PartitionEnfocementlnbound component of the Portlnfo Attribute is 3 

set to one, then any packet received on a virtual lane other than 15 shall g 
either be discarded or truncated in length such that it contains no more 

than 64 bytes if all of the following conditions are true: ^ 

11 

• LNH field of the LRH contains binary 11. 12 

• IPVer field of the GRH does not contain 6. 

14 

0I8-I6: If a switch provides the inbound P_Key enforcement service and 
if the PartitionEnforcementlnbound component of the Portlnfo Attribute is ^ ^ 
set to one, then any packet that is too short to contain a BTH and that the ^ ^ 
LNH field contains binary 11 shall be discarded or shall be forwarded with 17 
the EBP delimiter appended and with the inverse of the valid VCRC. 1 8 

19 

Raw packets, i.e. packets in which the LNH field of the LRH contains bi- 20 
nary 00 or binary 01 , are not subject to P_Key enforcement and are not 
discarded nor truncated by this mechanism 

22 

18.2A3 Packet Relay 23 

Packet relay refers to the operation of transferring a packet from the virtual 
lane on the inbound port to the virtual lane on a outbound port. 25 

26 

C18-18: A switch shall relay each unicast packet from the data virtual 27 
lane(s) in which it was received to the output port indicated by the unicast 28 
foHA^arding table entry corresponding to the packet's DLID field. 

A switch performs this relay function regardless of the state of the desti- 
nation port. In certain states, the destination port will discard the packet. 3^ 
This is described in detail in 7.2 Link States on page 135 . 32 

33 

C18-19: Each packet received on virtual lane 15 in switches that imple- 34 
ment independent buffering for virtual lane 15 on each port shall be re- 35 
layed to the virtual lane 15 on the output port indicated by the unicast 
forwarding table entry corresponding to the packet's DLID field. 



36 
37 

C18-20: If a packet is relayed to the same port on which it was received, 38 
it shall be discarded. 39 

40 

(Note: Directed route packets are permitted to be transmitted from the 4^ 
port from which they were received. This does not violate the above re- 
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C18-21 : No packet contents shall be modified by the switch except as re- 
quired by this specification. 



C18-22: Packets received on ports other than port 0 with a DLID equal to 
the permissive address shall be forwarded to port 0. 



The mechanism for the SMI to specify the port is not defined by IB and 
may vary by implementation. 



17 
18 
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quirement since the packet is actually relayed from the received port to 1 
port 0, the SMA; then it is received from port 0 and transmitted out the 2 
original port). 3 

4 

5 
6 

This chapter specifies various conditions under which a packet may or 7 
must be truncated in length. These conditions do not imply that the Pk- 8 
tLen or PayLen fields may be modified. g 

10 
11 
12 

018-17: If port 0, the SMI, or GSI of a switch does not contain sufficient 13 
free buffering to receive the packet, the packet may be discarded. 14 

15 

A special address, the permissive address (see 4.1 Terminoloov And Con- ^ g 
cepts on page 109 ) is defined by IB to permit the subnet manager to com- 
municate with the SMI without knowledge of the LID assigned to the SMI. 
Packets with the permissive address received on ports other than 0 are 
always forwarded to port 0. 1 ^ 

20 

C18-23: Packets with the permissive address received on port 0 (i.e. gen- 21 
erated by the SMI) shall be forwarded to the port specified by the SMI. 22 

23 
24 
25 

o1 8-18: Switches that support more than one virtual lane in addition to the 26 
management virtual lane (virtual lane 1 5), shall set the value of the VL 27 
field in the local route header as defined in section 7.6.6 VL Mappina 28 
Within a Subnet on page 152 . 29 

30 

018-19: Switches that support one virtual lane in addition to the manage- 
ment virtual lane (virtual lane 1 5), may implement VL Mapping as defined 
in section 7.6.6 VL Mapping Within a Subnet on page 152 . 32 

33 

o1 8-20: Switches that support more than one virtual lane in addition to the 34 
management virtual lane (virtual lane 15), shall relay packets to the VL of 35 
the output port as defined in section 7.6.6 VL Mapping Within a Subnet on 
page 152 if the corresponding VL is implemented on the output port. 

01 8-21 : Switches that support more than one virtual lane in addition to the 

management virtual lane (virtual lane 15), shall discard packets if the 39 

output VL as defined in section 7.6.6 VL Mapping Within a Subnet on page 40 

152 is not implemented on the output port. 41 

42 
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018-22: Switches that support only one virtual lane in addition to the man- 1 
agement virtual lane (virtual lane 15) shall not modify the VL field. 2 

3 

CI 8-24: Switches that support only one virtual lane in addition to the man- ^ 
agement virtual lane (virtual lane 15) shall relay packets to the VL of the 
output port indicated by the VL field in the LRH. ^ 

6 

For switches that support only one data virtual lane, the link layer will dis- 7 
card all packets that do not contain either 0 or 1 5 in the VL field, therefore, 8 
there is no need for the relay function to modify the VL field in this case, g 



10 
11 



C18-25: Except for virtual lane 15, if the virtual lane on the outbound port 
does not contain sufficient space for the packet to be relayed, then the 
packet shall remain in the virtual lane on the inbound port until sufficient 

space is available or until the switch lifetime limit mechanism permits the ^ 3 

discard of the packet. 1 4 

15 

C18-26: If the relay function is unable to relay packet from an inbound port ^5 

to an outbound port due to lack of sufficient space in the outbound VL, the ^ j 
relay function shall continue to relay packets from other virtual lanes des- 
tined for virtual lanes on outbound ports with sufficient space. 

19 

018-23: In switches that implement independent buffering on each port for 20 

virtual lane 15, if when relaying virtual lane 15 packets the virtual lane on 21 

the output port does not contain sufficient space for the packet to be re- 22 

layed, then the packet may be discarded. 23 

C18-27: Packets shall be transmitted on a given port and SL in the same 
order as they were received from a given port except that ordering be- 
tween unicast and multicast packets is not required. 26 

27 

o18-24: The relay function may, but is not required to, relay packets in the 28 
inbound portion of virtual lanes that are behind packets that are blocked 29 
due to insufficient space in the outbound portion of virtual lanes. 

The method of arbitration when multiple inbound VLs have packets des- 
fined for the same outbound VL is left to the implementor, but the arbitra- ^2 
tion should service all inbound ports fairly. 33 

34 

C18-28: The forwarding table shall be configured in one of two ways, 35 
linear or random, as defined in section 18.2.4.3.1 Linear Forwarding Table 
Requirements on pace 822 and 18.2.4.3.2 Random Forwarding Table Re- 
quirements on page 823 . 

38 

C18-29: Switches shall conform to the requirements in section 18.2.4.3.3 39 
Required Multicast Relay on page 824 . 40 

41 

42 
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018-25: Switches may implement the requirements in section 18.2.4.3.4 1 
Optional Multicast Relay on page 825 . 2 

3 

C18-30: A switch that does not implement the optional multicast relay ^ 
shall set the MulticastFDBCap component of the Switchlnfo attribute to 

zero. ^ 

6 

18.2.4.3.1 Linear Forwarding Table Requirements 7 

This section describes the requirements related to the linear forwarding 8 

table. The linear forwarding table provides a simple map from LID to des- 9 

tination port. Conceptually, the table itself contains only destination ports; i q 

the LID acts as an index into the table from which the packet's destination ^ ^ 

address is obtained. ^ 2 

C1 8-31 : In switches that implement the linear forwarding table, the linear ^ ^ 
forwarding table shall contain a port entry for each LID starting from zero 14 
and incrementing by one up to the size of the fonwarding table. 1 5 

16 

CI 8-32: In switches that support the linear forwarding table, the size of the ^ 7 
linear forwarding table shall be advertised in the LinearFDBCap compo- 
nent of the Switchlnfo attribute. 

19 

CI 8-33: In switches that support the linear forwarding table, the Random- 20 
FDBCap component of the Switchlnfo attribute shall be set to zero. 21 

CI 8-34: In switches that implement the linear forwarding table, the linear 23 
foHA^arding table shall be programmable using the LinearForwardingTable 24 
attribute as describe in 14.2.5.10 LinearForwardingTable on page 645 . 

Note that forwarding to the SMI/GSI is enabled by programming the cor- 
responding entries in the forwarding table to port 0. Setting the LID com- 27 
ponent of the Portlnfo attribute does not automatically load this value in 28 
the forwarding table. 29 

30 
31 
32 
33 

C18-36: Switches that implement linear fonvarding tables shall discard all 34 
unicast packets that meet any of the following conditions: 35 

36 
37 
38 
39 
40 
41 
42 



C18-35: A switch that implements a linear forwarding table shall support 
the SM programmable LinearFDBTop component of the Switchlnfo at- 
tribute as described in 14.2.5.4 Switchlnfo on page 632 . 
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• the packet's DLID value is greater than the value of LinearFDBTop 1 
and is not the permissive address 2 

• the packet's DLID is outside the range supported by the linear for- 3 
warding table and is not the permissive address 4 

• the port number in the forwarding table corresponding to the packet's 5 
DLID is set to a port that does not exist. 6 

18.2.4.3.2 Random Forwarding Table Requirements 7 

This section describes the requirements related to the random fonA/arding ^ 

table. Conceptually, the random forwarding table acts as a "content ad- 9 

dressable memory"; it is loaded with both LIDs and destination ports. The 1 0 

table is "addressed" by a packet's LID and the corresponding destination ^i 

port is returned. A switch implementation can limit the numberof LIDs that ^2 
correspond to a given port to as few as one. This enables the implemen- 
tation of a "leaf switch, i.e., a switch that supports only the connection of 
CA's to all ports but one. Such a switch requires a very small forwarding 

table (one LID per port). Such limitations are neither mandated nor pro- 15 

hibited by this specification. 1 6 

17 

C18-37: In switches that implement the random forwarding table, the 
random forwarding table shall provide for the storage of a set of unicast 
LID/LMC pairs and corresponding destination port entries. 

C18-38: Switches that implement the random forwarding table shall main- ^1 

tain a DefaultPort value which shall be programmable via the DefaultPort 22 

component of the Switchlnfo attribute (see 14.2.5.4 Switchlnfo on page 23 

632 for additional detail). 24 



13 
14 



25 
26 



C18-39: Packets that arrive on ports other than the port indicated by De- 
faultPort with a unicast DLID field that does not match an entry in the 
random fonA^arding table and is not equal to the permissive address shall 
be forwarded to the port indicated by DefaultPort. 28 

29 

C18-40: If the DefaultPort value is a port that does not exist then packets 30 
that would otherwise be forwarded to this port shall be discarded. 3^ 



32 
33 



C18-41 : Packets that arrive on the port indicated by DefaultPort with a uni- 
cast DLID field that is not the permissive address and does not match an 
entry in the random forwarding table shall be discarded. 34 

35 

Matching an entry in the table means that the packet's DLID matches the 36 
LID in the table excluding the LMC least significant bits. 37 



38 
39 



C18^2: Switches that implement the random forwarding table shall ad- 
vertise the size of the table, i.e. the number of LID/LMC pairs that it may 
contain, in the RandomFDBCap component of the Switchlnfo attribute. 

41 
42 
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C18-43: Switches that implement the random forwarding table shall set 1 
the LinearFDBCap component of the Switchlnfo attribute to zero. 2 

3 

o1 8-26: Switches that implement a random forwarding table may limit the ^ 
number of LID/LMC pairs that can be assigned to a given port. 

5 

C1 8-44: If a switch that implements the random forwarding table limits the ^ 
number of LID/LMC pairs that can be assigned to a given port, then it shall 7 
set the LIDsPerPort component of the Switchlnfo component to the 8 
number of LIDs that is supported per port. 9 



The LIDsPerPort component does not apply to port 0. 



10 
11 
12 



C1 8-45: If a switch that implements the random forwarding table does not 
impose such limitation on the number of LID/LMC pairs that can be as- 
signed to a given port, it shall set the value of the LIDsPerPort component 
the same as the RandomFDBCap component. ^ 3 

14 

01 8-46: In switches that support the random forwarding table, the random 1 5 
forwarding table shall support exactly one LID/LMC entry. ^5 

17 
18 

18.2.4.3.3 Required Multicast Relay ^ 9 

018-47: All switches shall maintain values for a default primary multicast 20 
port and a default non-primary multicast port. 21 

22 

All switches maintain values for default primary multicast port and a de- 23 
fault non-primary multicast port regardless of whether the switch supports 24 
multicast forwarding and regardless of the type of unicast forwarding table 
implemented. 

26 

018-48: Switches shall allow the SM to set the values of the default pri- 27 
mary multicast port and a default non-primary multicast port using the De- 28 
faultMulticastPrimaryPort and DefaultMulticastNotPrimaryPort 29 
components of the Switchlnfo attribute. 3q 



31 
32 



018-49: All multicast packets that are received on ports other than the de 
fault multicast primary port shall be forwarded to the default multicast pri 
mary port if any of the following conditions are true: ^3 

34 

35 
36 
37 
38 
39 
40 
41 
42 
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• The switch does not implement a multicast forwarding table. 1 

• The switch implements a multicast fonA/arding table and the multicast 2 
DLID in the packet is outside the range of the multicast forwarding ta- 3 
ble. 4 

• The switch implements a multicast forwarding table and the entry in 5 
the forwarding table corresponding to the packet's DLID is zero. 6 

C18-50: All multicast packets that are received on the default multicast ^ 

primary port shall be fonA/arded to the default multicast non-primary port if 8 

any of the following conditions are true: 9 

10 

• The switch does not implement a multicast forwarding table. ^ ^ 

• The switch implements a multicast fonA/arding table and the multicast 1 2 
DLID in the packet is outside the range of the multicast forwarding ta- >|3 
ble. 14 

• The switch implements a multicast forwarding table and the entry in ^ 5 
the forwarding table corresponding to the packet's DLID is zero. 

C1 8-51 : If either the default multicast primary port or default multicast 1 7 

non-primary port is set to a port that does not exist then multicast packets ^ g 
that would otherwise be forwarded to the corresponding port shall be dis- 
carded. 

20 

18.2.4.3.4 Optional Multicast Relay 21 

This section describes the requirements for the optional replication of mul- 22 

ticast packets. 23 

24 

o18-27: The replication of packets as part of multicast relay is optional. 25 



26 
27 



018-28: Switches that support multicast packet replication shall imple- 
ment a multicast forwarding table that contains a port entry for each mul 
ticast LID starting from OxcOOO and sequentially incrementing to include 
the total number of multicast entries supported. 29 

30 

018-29: In switches that support multicast packet replication, the number 31 
of multicast entries supported in the multicast forwarding table shall be at 32 
least one and no greater than 16383. 

0I8-3O: In switches that support multicast packet replication, the number 
of multicast entries supported in the multicast forwarding table shall be ad- 35 
vertised in the MulticastFDBCap component of the Switchlnfo attribute. 36 

37 

0I8-3I: In switches that support multicast packet replication, if the DLID 33 
of a packet is a multicast LID, then the switch shall relay the packet to the 
set of ports, excluding the port on which the packet was received, indi- 
cated by the multicast forwarding table entry corresponding to the 
packet's DLID field. 41 

42 



39 
40 
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018-32: In switches that support multicast packet replication, the virtual 1 
lane field shall be updated in each replicated packet in the same manner 2 
as for unicast packets. 3 

4 

5 

Relayed packets are queued in the outbound portion of virtual lanes. g 

7 
8 



18.2.4.4 Transmitter Queuing 



18.2.4.4.1 Outbound P_Key Enforcement 

The implementation of the outbound P_Key enforcement service in 
switches is optional. This section specifies the requirements of the ser- ^ 
vice if implemented. 1 0 

11 

Outbound P_Key verification shall be enabled and disabled for each port 1 2 
individually based on the PatrtitionEnforcementOutbound component of ^3 
the Portlnfo attribute. 

14 

018-33: If a switch provides the outbound P_Key enforcement service ^ ^ 
and the PartitionEnfocementOutbound component of the Portlnfo At- 1 6 
tribute is set to zero, then the outbound P_key enforcement service shall 17 
be disabled for packets received on the corresponding port. 18 

19 
20 
21 

01 8-35: If a switch provides both the inbound P_Key enforcement service 22 
and the outbound P_Key enforcement service, then the list of P_Keys as- 23 
sociated with each port shall be the same list for both the inbound P_Key 24 
enforcement service and the outbound P_Key enforcement service. 25 



o18-34: If a switch provides the outbound P_Key enforcement service, it 
shall maintain a separate list of P_Keys associated with each port. 



26 
27 



018-36: If a switch provides the outbound P_Key enforcement service, 
the P„Key table associated with each port shall be capable of containing 
between one and 65535 P_Keys, inclusive (the exact number is left as an 
implementation parameter). 29 

30 

o18-37; If a switch provides the outbound P_Key enforcement service, 31 
the P_Key table associated with each port shall be programmable using 32 
the P_KeyTable attribute defined in 14.2.5.7 P KevTable on pace 642 . 

018-38: If a switch provides the outbound P_Key enforcement service 
and if the PartitionEnfocementOutbound component of the Portlnfo At- 35 
tribute is set to one, then any packet to be transmitted on a virtual lane 36 
other than 1 5 on that port shall either be discarded or truncated such that 37 
it contains no data past the BTH if the value in the P„Key field in the BTH 33 
is not contained in the transmitting port's P_Key list and either of the fol- 3g 
lowing conditions are true: 

41 
42 
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18.2.4.5 Packet Transmission 



18.2.5 Error Handling 



• LNH field in the LRH contains binary 11 and IPVer field in the GRH 
contains 6. 

• LNH field in the LRH contains binary 10. 

018-39: If a switch provides the outbound P_Key enforcement service 
and if the PartitionEnforcementOutbound component of the Portlnfo At- 
tribute is set to one, then any packet to be transmitted on a virtual lane 
other than 15 of that port shall either be discarded or truncated in length 
such that it contains no more than 64 bytes if all of the following conditions 
are true: 

• LNH field of the LRH contains binary 1 1 . 

• IPVer field of the GRH does not contain 6. 

O18-40: If a switch provides the outbound P_Key enforcement service 
and if the PartitionEnforcementOutbound component of the Portlnfo At- 
tribute is set to one, then any packet that is too short to contain a BTH and 
that the LNH field contains binary 11 shall be discarded or shall be for- 
warded with the EBP delimiter appended and with the inverse of the valid 
VCRC. 

Raw packets, i.e. packets in which the LNH field of the LRH contains bi- 
nary 00 or binary 01 , are not subject to P_Key enforcement and are not 
discarded nor truncated by this mechanism 



CI 8-52: If the FilterRawOutbound component of the transmitting port's 
Portlnfo attribute is set to one, then the switch shall discard all packets to 
be transmitted on that port in which the LNH field of the LRH contains bi- 
nary 00 or binary 01 (i.e. raw packets). 

CI 8-53: Each packet shall be transmitted with a valid VCRC field com- 
puted as specified in section 7.8.2 Variant CRC (VCRC) - 2 Bytes on oaqe 
163 . unless required otherwise in this chapter or in chapter Chapter 7: 
Link Layer on page 134 . 

CI 8-54: Each packet shall be transmitted with an egp character ap- 
pended unless required otherwise in this chapter. 

C18-55: Switches shall support the requirements specified In 7.6.9 VL Ar- 
bitration and Prioritization on pace 154 . 



This section specifies required operation under error conditions. Like the 
previous section, this section is divided into several architectural func- 
tions. This division does not imply a particular implementation; It is done 
solely to enhance the organization of the specification. 
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18.2.5.1 Switch Ports 1 

CI 8-56: Each port except port 0 on an IBA Switch shall comply with the 2 
physical layer error requirements specified in Chapters: Physical Layer 3 
Interface on page 130 . 4 

5 

CI 8-57: Each port except port 0 on an IBA Switch shall comply with the g 
link layer error requirements specified in Chapter 7: Link Layer on page 
134 of this specification. 

8 

18.2.5.2 Receiver Queuing 9 

There are no additional receiver queuing error handling requirements. 

11 

18.2.5.3 Packet Relay 12 

There are no additional packet relay error handling requirements. 13 

14 

1 8.2.5.4 Transmitter Queueing ^ 5 

The transmitter packet discard is based on, among other things, two time 1 6 
values: Switch Lifetime Limit (SLL) and Head of Queue Lifetime Limit -i/ 
(HLL). ^8 

19 
20 
21 
22 

HLLisdefinedas4.096us*2^'-if0<HL<19,+5%/-55%. HL is the HO- 23 
QLife component of the Portlnfo attribute. If HL > 19, then HLL is to be 24 
interpreted as infinite. 25 

26 

C18-58: The transmitter queueing function shall discard any packet that 27 
meets any of the following conditions: 

• The packet has been at the head of the Virtual Lane (i.e. the position ^ 
to be transmitted next), and has not begun transmission within HLL. 30 

31 

• The packet is queued to a VL that is in the VL stalled state. If VL- 
StallCount sequential packets are discarded from a given VL due to ^2 
exceeding the HLL requirement above, the VL shall enter the VL 33 
stalled state. A VL shall leave the VL stalled state 8 * HLL after enter- 34 
ing it. VLStallCount component is provided in the Portlnfo attribute. 35 

• The size of the packet as indicated by the PktLen field exceeds the 36 
MTU supported by the neighbor device as indicated by the Neigh- 37 
borMTU component of the Portlnfo attribute. 33 

CI 8-59: f a switch by virtue of its implementation cannot guarantee that 39 
any packet entering it will be transmitted within 2.5 ms., measured first bit 40 
in to first bit out and assuming flow control credit is continuously available, 4^ 

42 



SLL is defined as 4.096us * 2^^ if 0 < LV < 1 9, +5% / -55%. LV is the Life- 
TimeValue component of the Switchlnfo attribute. If LV > 19, then SLL is 
to be interpreted as infinite. 
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18.2.5.5 Packet Transmission 



then it shall discard any packet that has not begun transmission within 
SLL measured from the time the first bit was received by the switch. 

o18-41 : If a switch by virtue of its implementation can guarantee that any 
packet entering it will be transmitted within 2.5 ms., measured first bit in to 
first bit out and assuming flow control credit is continuously available, then 
it may discard any packet that has not begun transmission within SLL 
measured from the time the first bit was received by the switch. 



CI 8-60: Each packet to be transmitted that is truncated in length as per- 
mitted or specified by any condition in this chapter be corrupted as spec- 
ified in 7.3 Packet Receiver States on pace 138 . 



18.2.6 Subnet Management Agent Requirements 



C18-61 : Switches shall support a Subnet Management Interface (SMI) as 
specified in Chapter 14: Subnet Management on page 610 . 

CI 8-62: Switches shall support a General Services Interface (GSI) as 
specified by Chapter 16: General Services on page 717 . 

There are various mandatory and optional requirements of these inter- 
faces that are specified in the respective chapters. 

CI 8-63: Switches shall implement P_Key checking on the GSI as speci- 
fied in section 10.9.8 Partition Enforcement on Management Queue Pairs 
on page 433 . 
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Chapter 19: Routers 



19.1 Overview 



IBA Routers are IBA packet relay devices, that operate at the network 
layer of the IBA addressing hierarchy to interconnect multiple locally ad- 
dressed subnets. As the top level in the hierarchy, IBA Routers rely on 
global identifiers (GIDs). 

Figure 193 Reference of Routers Connecting Subnets 




Routers provide 
connectivity 
among subnets 



IBA Router usage is meant to satisfy: 

1) Scalability 

2) Local address space reuse 

3) Containment of failures and topology changes 

4) Confinement of fabric management scope to subnets 
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In fulfilling these objectives, IBA Routers also allow the IBA semantics, 1 
and QoS characteristics to be extended across IBA subnets. 2 

3 

IBA Routers are required to support unicast routing and may support mul- ^ 
ticast routing. The specification of the routing forwarding mechanisms in 
this chapter is presently limited to unicast routing. 

6 

IBA Routers use destination based routing, where every destination port 7 
within the global fabric is assigned one or more unique Global Identifiers 8 
(GID). From the point of view of a Router, a GID represents either an end- g 
node port or another router's port on a directly attached subnet. A GID 
does not necessarily represent a path through the fabric, as the Router is 
allowed to spread traffic over several paths based on other packet header 
criteria. 

13 

A Router is visible to IBA nodes on the directly attached subnets, and it is 14 
transparent to nodes on any remote subnet. The Subnet Manager ad- 1 5 
dress resolution function makes local routers visible to endnodes; endn- 
odes in turn use this information when addressing packets to a local router 
LID on their way to a remote destination. Routers on the same subnet are 
also visible to each other, both for the purpose of implementing a routing 
protocol, and also when routing packets through other routers as the next ^ ^ 
hop. Finally, routers on a subnet are also visible to Subnet Managers on 20 
their respective directly attached subnets. 21 

22 

Each Router port must support a Subnet Management Interface (SMI) 23 
(see 13.5.1.1 Processing Subnet Management Packets (SMPs) on page ^4 
601 ) and a General Services Interface (GSI) (see 13.5.1.2 Processing 
General Services Management Packets (GMPs) on page 602 ). There are 
various mandatory and optional requirements of this interface that are 26 
specified in the management chapter. Subnet Managers assign LIDs to 27 
IBA Router ports and provide a service to find a path to other endnodes 28 
or routers. The Router Management section of the Management chapter 29 
defines the attributes of the interactions between Subnet Managers and 
Routers. 

31 
32 
33 

19,2 Detailed functional requirements 34 

The present IBA Router specification does not cover the routing protocol 35 
nor the messages exchanged between routers. Future revisions of this 36 
chapter will complete such control functions. 37 

38 
39 

IBA Routers reside at the boundaries between subnets, and are config- 40 
ured separately per port by different Subnet Managers and at different 4^ 
times. Subnet managers supply IBA Routers with LIDs/LMCs (for each 
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port separately), and additional path information like SL to VL mappings 1 
and MTU values. 2 

3 

Unicast Routing Table: ^ 

C19-1 : A router shall implement a unicast routing table with at least as 
many entries as the number of router ports. ^ 

7 

IBA Routers have routing tables for their active routes. These tables are 8 
hierarchical and include explicit endnode routes (e.g. last hop to end- g 
node), prefix routes (aggregate route for entire subnet), and possibly de- 
fault routes (routes for unknown prefixes). The size of the unicast routing 
table is implementation dependent. 

Virtual lanes: 13 

14 

C1 9-2: Routers shall implement the subnet management virtual lane (also 1 5 
referred to as virtual lane 15). ^5 

17 
18 
19 
20 

C19-4: Virtual lane 15 shall be implemented independently for each router 21 
port. 22 



C19-3: Router ports shall implement data virtual lanes as specified in 7.6 
Virtual Lanes Mechanisms on page 146 . in addition to the subnet man- 
agement virtual lane, numbered virtual lane 15. 



C19-5: Routers shall not route any VL15 packets between router ports. 



C19-7: Routers shall preserve the Tclass value when routing. 



23 
24 



SL to VL mapping: 

26 

C19-6: Routers that implement more than one data virtual lane shall im- 27 
plement the SL to VL mapping function specified in this chapter. 28 

29 

0I9-I : Routers that implement one data virtual lane may implement the 
SL to VL mapping function specified in this chapter. 

Per port SL to VL mapping is required on Routers that support more than ^2 
one virtual lane in addition to virtual lane 15. It is optional on Routers that 33 
support only one virtual lane in addition to virtual lane 15. 34 

35 

Tclass to SL mapping: 3q 

37 
38 

SL values may be replaced when routing into a different subnet. The 39 
Tclass value is preserved, as it represents the Class of Service with end- 40 
to-end IBA scope. The Tclass to SL mapping function is not defined by the 41 
present specification revision. 42 
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19.2.2 Initialization 



P_Key Enforcement: 

o19-2: Routers may implement the Inbound P_Key Enforcement Service 
specified in this chapter. 

o19-3: Routers may implement the Outbound P_Key Enforcement Ser- 
vice specified in this chapter. 

Routers may enforce partitions on ingress to and/or egress from the 
Router. 

Maximum Transfer Unit (MTU) size: 

C19-8: Each router port shall independently support one of the MTU sizes 
specified in Table 19 Packet Size on page 161 . 

C19-9: Routers shall be capable of routing packets of size from the min- 
imum valid packet size up to the supported MTU of the intervening ports 
plus 126 bytes. 

Table 19 Packet Size on page 161 specifies a choice of MTU that may be 
supported by IBA devices. Router implementations independently support 
one of the specified MTU sizes for each port. Packets exceeding the MTU 
size of the participating links may be discarded or truncated. Each port 
provides sufficient buffering for each data VL to advertise credit for at least 
one packet with MTU pay load. 

Link Physicals: 

IBA specifies various physical layer options. Routers may implement any 
of these options on any port and there is no requirement that all ports of a 
Router implement nor operate with the same physical options. Routers 
shall conform to the detailed requirements for physical layer support as 
specified in Chapter 6: Physical Laver Interface on page 130 . 

End-to-end data integrity: 

Although IBA routers modify some packet headers during routing, none of 
these headers affects the value of the ICRC, and IBA Routers shall pre- 
serve the original ICRC rather than recomputing its value locally. 



019-1 0: Upon power-up, a router shall be initialized to the following state: 
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• All initialization defined in Chapter 13: Manaqennent Model on page 1 
564 and applicable to routers. 2 

• Physical and link layers reset. 3 

• All virtual lane queues shall be cleared. 

5 

Other routing table entries and all queues (and any other table type 
structures) shall be cleared. 

19.2.3 Configuration q 

Routing tables - Entries may be derived from any combination of exter- g 



nally configured routes and autonomously computed routes. In particular, 
routers rely on the SM database for routes to endnodes on directly at- 
tached subnets. 

12 

SL mapping tables - SL to VL mapping tables exist at every router port. 13 

14 

Tclass mapping - Any necessary configuration of the Tclass end-to-end •] 5 
class of service role in determining the local SL value is configured into 
the router. 

19-2.4 Packet Relay Model 

19 

The logical abstraction for IBA packet routing is that of packet by packet 



routing and is given by: 

21 

if 22 

23 

i) ((BASE DLID == router port BASE LID) AND 24 

ii) (LRH:Next Header ==GRH) AND 25 

iii) (Destination GID <> router GID) AND 26 

27 

iv) (Destination GID matches entry in route table) AND 

28 

V) (VCRC OK) AND (Hop count > 1)) 
then { 30 

31 

i) Replace DLID with value from routing table 

ii) Replace SLID with LID of output port 33 

iii) Replace SL, considering Tclass (among other possible crite- 34 
ria) 35 

iv) Map SL to VL using per output port table 36 

v) Decrement Hop count 37 

00 

vi) Recompute VCRC, preserve ICRC 

3Q 

^ 40 

41 
42 
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Note: Routers may also check ICRC, or just rely on the end-to-protection 1 
of endnodes checking ICRC. 2 



19.2.4.1 Path Selection 



19.2.4.2 Router Ports 



The above abstraction, combined with the Network layer Addressing 
model, dictates a longest match against the destination GID. Implementa- 
tions may exploit the addressing model to relax this function to a combi- 
nation of a 64-bit longest match for prefix type entries, and either a 64-bit 
or 128-bit fixed length match for explicit routes, depending on the unique- 
ness scope of the lowest 64 bits of the GID. 



A Router may support multiple paths to a given DGID. These may include 
paths via the same next-hop and/or different next-hops as well as different 
paths within the subnet (LMC based) to a given GID. 

An IBA Router may actively use multiple paths with equal or different 
costs, as long as it does not affect ordering by separating packets of a 
given session.To allow IBA Routers to have different degrees of sophisti- 
cation in determining what packets may be separated, a session is used 
in a deliberately vague way. 

The baseline assumption is that endpoints will use identical GRH:Flow- 
Label values for sequences of packets whose relative ordering is impor- 
tant, therefore a possible session representation at the router would be 
the (DGID, SGID, TCIass, SL, FlowLabel) tuple. A router may use other 
attributes to select a path but once selected that path will continue to be 
used for subsequent packets unless a management event dictates a new 
path. 



A router with N physical ports, associates PORTINFO attributes to its 
physical ports based on the Port Number Attribute Modifier value, ranging 
between 1 and N. 

019-11: Each port on an IBA Router shall comply with the physical layer 
requirements specified in Chaoter 6: Phvsical Layer Interface on page 
130 . 

019-12: Each port on an IBA Router shall comply with the link layer re- 
quirements specified in Chapter 7: Link Laver on page 134 of this specifi- 
cation. 

019-13: Each port on an IBA Router shall implement the Subnet Manage- 
ment PORTINFO Attribute specified in 14.2.5.6 Portlnfo on page 633 . 

019-14: Each port on an IBA Router shall have at least one GID assigned 
to it. 
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2 

1 9.2.4.3 Receiver Queuing 3 

The receiver queueing function receives packets from the link layer de- 4 
fined in Chaoter 7: Link Layer on page 134 . 5 



6 
7 



C19-15: The virtual lane into which an individual packet is queued shall 
be the virtual lane whose virtual lane number matches the VL field in the 
packet's Local Route Header. ^ 

9 

C19-16: If the FilterRawlnbound component of the receiving port's Port- 10 
Info attribute is set to one, then the Router shall discard all packets re- ^ 
ceived on that port in which the LNH field of the LRH contains a binary 00 ^2 
or binary 01 (i.e. raw packets). 

19.2.4,3.1 Inbound P_Key Enforcement 

The implementation of the inbound P_Key enforcement in Routers is op- ^ ^ 
tional. This section defines its requirements if implemented. ^6 

17 

Inbound P_Key verification shall be enabled and disabled for each port in- 1 8 
dividually based on the PartitionEnforcementlnbound component of the 
Portlnfo attribute. 2q 

91 

o19-4: If a router provides the inbound P_Key enforcement service and 
the PartitionEnfocementlnbound component of the Portlnfo Attribute is 22 
set to zero, then the inbound P_key enforcement service shall be disabled 23 
for packets received on the corresponding port. 24 

25 

o19-5: If a router provides the inbound P_Key enforcement service, it 25 
shall maintain a separate list of P_Keys associated with each port. ^7 

0I9-6: If a router provides both the inbound P_Key enforcement service 

and the outbound P_Key enforcement service, then the list of P_Keys as- 29 

sociated with each port shall be the same list for both the inbound P_Key 30 

enforcement service and the outbound P„Key enforcement service. 31 

32 

o19-7: If a router provides the inbound P_Key enforcement service, the ^3 
P_Key table associated with each port shall be capable of containing be- 
tween 1 and 65535 P_Keys, inclusive (the exact number is left as an im- 
plementation parameter). 35 

36 

0I9-8: If a router provides the inbound P_Key enforcement service, the 37 
P_Key table associated with each port shall be programmable using the 33 
P_KeyTable attribute defined in 14.2.5.7 P KevTable on pace 642 . 

o19-9: If a router provides the inbound P_Key enforcement service and 
if the PartitionEnfocementlnbound component of the Portlnfo Attribute is 

42 
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19.2.4.4 Packet Relay 



set to one, then any packet received on a virtual lane other than 15 shall 
either be discarded or truncated such that it contains no data past the BTH 
if the value in the P_Key field in the BTH is not contained in the receiving 
port's P_Key list and either of the following conditions are true: 

• LNH field in the LRH contains binary 11 and IPVer field in the GRH 
contains 6. 

• LNH field in the LRH contains binary 10. 

O19-10: If a router provides the inbound P_Key enforcement service and 
if the PartitionEnfocementlnbound component of the Portlnfo Attribute is 
set to one, then any pacl<et received on a virtual lane other than 15 shall 
either be discarded or truncated in length such that it contains no more 
than 64 bytes if all of the following conditions are true: 

• LNH field of the LRH contains binary 11. 

• IPVer field of the GRH does not contain 6. 



Raw packets are not subject to P_Key enforcement and shall not be dis- 
carded nor truncated by this mechanism. 



Packet relay refers to the operation of transferring a packet from the virtual 
lane on the inbound port to the virtual lane on a outbound port. The output 
virtual lane selection is specified later in this section. The relay function is 
performed regardless of the state of the destination port. In certain port 
states the destination port will discard the packet. 

CI 9-1 7: A router shall relay each unicast packet from the virtual lane zero 
through fourteen (if implemented) in which it was received to the output 
port indicated by the routing table entry corresponding to the packet's 
DGID field. 

C19-18: Packets received on virtual lane 15 shall not be relayed to output 
ports. 

A packet may be relayed to the same port on which it was received, this 
is necessary to support some routing scenarios like endnodes using one 
out of several routers on the subnet as a default router. 

o19-11: Routers that support more than one virtual lane, in addition to vir- 
tual lane 15, shall set the value of the VL field in the local route header by 
first considering the GRH Tclass field to derive a SL value for the subnet 
attached to the output port and then using the SL to VL mapping scheme 
as defined in section 7.6.6 VL Mapping Within a Subnet on oaae 152 . 
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C19-23: Packets shall be transmitted on a given port and SL in the same 
order as they were received from a given port. 



C19-25: Packets whose Hop count is less than 2 shall be discarded. 



19.2.4.5.1 Outbound P Key Enforcement 
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C19-19: Routers shall always recognize and map a Tclass value of 0 to a 1 
best effort SL. 2 

3 

CI 9-20: Each packet relayed from an inbound port shall be placed on the ^ 
virtual lane of the outbound port specified by the SL to VL mapping. If the 
new value of VL does not correspond to a configured VL on the outbound 
port, the packet shall be discarded. Also, packets received with a VL not ^ 
configured for the port shall be discarded. 7 

8 

CI 9-21: If the virtual lane on the outbound port does not contain sufficient g 
space for the packet to be relayed, then the packet shall remain in the vir- ^ q 
tual lane on the inbound port until sufficient space is available or until the 
router lifetime limit mechanism permits the discard of the packet. ^ ^ 

CI 9-22: If the relay function is unable to relay packet from an inbound port 1 3 
to an outbound port due to lack of sufficient space in the outbound VL, the 1 4 
relay function shall continue to relay packets from other virtual lanes des- 1 5 
tined for virtual lanes on outbound ports with sufficient space. 

17 
18 
19 

o19-12: The relay function may, but is not required to, relay packets in the 20 
inbound portion of virtual lanes that are behind packets that are blocked 21 
due to insufficient space in the outbound portion of virtual lanes. 22 

23 

The method of arbitration when multiple inbound VLs have packets des- 
tined for the same outbound VL is left to the implementor, but the arbitra- 
tion should service all inbound ports fairly. 

26 

CI 9-24: Routers shall not continuously assert backpressure (i.e. fail to 27 
grant link credits). Regardless of what congestion policy an IBA router as- 28 
sociates to its relay function, routers shall not cause deadlock in the fabric. 29 



30 
31 



C19-26: The Hop count of every relayed packet is decremented by one. ^2 

33 

19.2.4.5 Transmitter Queuing 34 

Relayed packets shall be queued in the outbound portion of virtual lanes. 35 



36 
37 



The implementation of the outbound P_Key enforcement service in 33 
routers is optional. This section specifies the requirements of the service 
if implemented. 
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Outbound P_Key verification shall be enabled and disabled for each port 1 
individually based on the PatrtitionEnforcementOutbound component of 2 
the Portlnfo attribute. 3 

4 

o1 9-13: If a router provides the outbound P_Key enforcennent service and 
the PartitionEnfocementOutbound component of the Portlnfo Attribute is 
set to zero, then the outbound P_key enforcement service shall be dis- ^ 
abled for packets received on the corresponding port. 7 

8 

019-14: If a router provides the outbound P_Key enforcement service, it g 
shall maintain a separate list of P_Keys associated with each port. 

11 

019-15: If a router provides both the inbound P_Key enforcement service 
and the outbound P_Key enforcement service, then the list of P_Keys as- ^ ^ 
sociate with each port shall be the same list for both the inbound P_Key 1 3 
enforcement service and the outbound P_Key enforcement service. 14 

15 

o1 9-16: If a router provides the outbound P_Key enforcement service, the ^ 5 
P_Key table associated with each port shall be capable of containing be- 
tween 1 and 65535 P_Keys, inclusive (the exact number is left as an im- 
plementation parameter). 



18 
19 

019-17: Ifa router provides the outbound P_Key enforcement service, the 20 
P_Key table associated with each port shall be programmable using the 21 
P_KeyTable attribute defined in 14.2.5.7 P KeyTable on page 642 . 22 



LNH field in the LRH contains binary 11 and IPVer field in the GRH 
contains 6. 

LNH field in the LRH contains binary 10. 



23 
24 



0I9-I8: If a router provides the outbound P_Key enforcement service and 
if the PartitionEnfocementOutbound component of the Portlnfo Attribute 

is set to one, then any packet to be transmitted on a virtual lane other than ^5 

1 5 on that port shall either be discarded or truncated such that it contains 26 

no data past the BTH if the value in the P_Key field in the BTH is not con- 27 

tained in the transmitting port's P_Key list and either of the following con- 28 

ditions are true: 29 

30 
31 
32 
33 

019-19: Ifa router provides the outbound P_Key enforcement service and 34 
if the PartitionEnfocementOutbound component of the Portlnfo Attribute 25 
is set to one, then any packet to be transmitted on a virtual lane other than 
1 5 of that port shall either be discarded or truncated in length such that it 
contains no more than 64 bytes if all of the following conditions are true: 37 

38 

LNH field of the LRH contains binary 01 or binary 11. 39 

• IPVer field of the GRH does not contain 6. 40 

41 
42 
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19.2.4.6 Packet Transmission 



Raw packets, i.e. packets in which the LNH field of the LRH contains bi- 
nary 00 or 01 , are not subject to P_Key enforcement and are not dis- 
carded nor truncated by this mechanism. 



C19-27: If the FilterRawOutbound component of the transmitting port's 
Portlnfo attribute is set to one, then the router shall discard all packets to 
be transmitted on that port in which the LNH field of the LRH contains a 
binary 00 or binary 01 (i.e. raw packets). 

C19-28: Routers shall perform SL to VL mapping as defined in 7.6.6 VL 
Mapping Within a Subnet on page 152 . This mapping is based on the out- 
bound SL to be used for the packet. 

C19-29: Packet shall be transmitted with a valid VCRC field computed as 
specified in section 7.8.2 Variant CRC fVCRC) - 2 Bvtes on paoe 163 . un- 
less required othenwise in this chapter. 

C19-30: Each packet shall be transmitted with an EGP character ap- 
pended unless required othen^/ise in this chapter. 

Routers shall support the requirements specified in 7.6.9 VL Arbitration 
and Prioritization on page 154 . 



19.2.5 Error Handling 



This section specifies required operation under error conditions for each 
of the conceptual functions. 



19.2.5.1 Router Ports Errors 

C19-31: Each port on an IBA router shall comply with the physical layer 
error requirements specified in Chapter 6: Phvsical Laver Interface on 
page 1 30 . 

C19-32: Each port on an IBA router shall comply with the link layer error 
requirements specified in Chapter 7: Link Laver on page 1 34 of this spec- 
ification. 

19.2.5.2 Receiver Queuing Errors 

The receiver queueing function may discard any packet that meets any of 
the following conditions: 
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• There is insufficient space in the virtual lane to receive a packet of the 1 
size indicated in the PktLen field in the local route header. 2 

• The size of the packet indicated by the PktLen field in the local route 3 
header indicates that the packet exceeds the MTU size supported by 4 
the Router port. 5 

The receiver queueing function nnay discard any packet if its transmission 6 

has not been initiated and if the packet meets any of the following condi- 7 

ions: 8 



The length of the packet was too short to contain a LRH, GRH, and a 
VCRC. 



port and that is not discarded shall be truncated to any size that meets the 
MTU size limitation of the port. 



Queue Lifetime Limit mechanisms defined for IBA Switches in 18.2.5.4 
Transmitter Queueing on page 828 . 



10 
11 



There is insufficient space in the virtual lane to receive the packet. 

The packet exceeds the MTU size supported by the output port. 

A VCRC error was detected on reception. 1 2 

The length of the received packet was different from that indicated by 1 3 
LRHPktLen. 14 

The packet has a framing error. 1 5 

The packet was received with an EBP delimiter appended. 

17 
18 
19 

019-33: Any packet that exceeds the MTU size supported by the output 20 

21 
22 

19.2.5.3 Packet Relay Errors 23 

019-34: Packets with no GRH, or with a GRH version not supported by 24 
the Router shall be discarded. 25 

26 

19.2.5.4 Transmitter Queueing Errors 27 

28 
29 

019-35: Routers shall implement the Packet Lifetime limits and Head of 

31 
32 

The Packet Lifetime limit is determined from the LifeTimeValue compo- 33 
nent of the Routerlnfo attribute using the same formula as the Switch Life- 34 
time Limit. The Head of Queue Lifetime Limit is determined from the 35 
HOQLife component of the Portlnfo attribute using the same formula as 35 
its switch counterpart. 2^ 

The above mentioned limits on packet lifetime inside IBA routers and 

switches are meant to help drain packets from the IBA fabric before they 39 

can present a hazard to the IBA transport layer finite sequence number 40 

space. These limits are not defined as congestion management mecha- 41 

42 
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19.2.5.5 Packet Transmission Errors 



nisms, and should not kick-in in normal circumstances, even in congestion 1 
scenarios. 2 

3 
4 

C19-36: Each packet to be transmitted that was received with an error in- 5 
dicated by the link layer shall be transmitted with an EBP character ap- g 
pended and the VCRC field shall contain the one's complement of the 
valid VCRC. ^ 

8 

019-20: Each packet to be transmitted that was received with an error in- 9 
dicated by the link layer may be truncated in length. 10 

11 

019-37: Each packet to be transmitted that is truncated in length as per- ^2 
mitted or specified by any condition in this chapter be corrupted as spec- 
ified in 7.3 Packet Receiver States on page 138 . 

14 

19.2.6 Subnet Management Agent Requirements 15 

16 
17 

019-38: Each router port shall implement a Subnet Management Inter- 18 
face (SMI) as specified in [Chapter 14: Subnet Management on page 610 . 1 9 

20 

019-39: Routers shall support a General Services Interface (GSI) as 21 
specified in Chapter 16: General Services on page 717 . 22 

The General Services Interface is used, for example, for GID to LID ad- 
dress resolution. 24 

25 
26 
27 
28 
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Those for Volume 2 are defined in InfiniBand Architecture Specification, 
Volume 2 Chapter "Volume 2 Compliance Summary". 



Each Category has a dedicated section in this chapter that contains, 
among other things, a complete reference list of Volume 1 compliance 



9 

10 
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Chapter 20: Volume 1 Compliance Summary i 

^ 2 

3 
4 
5 

20.1 Compliance Definition g 

This chapter specifies the Compliance Categories that are approved for 7 
labeling various products that contain InfiniBand content. This will allow g 
vendors to label their products and claim InfiniBand compliance without 
creating confusion in the marketplace. This chapter addresses compli- 
ance to the feature set defined by Volume 1 of the InfiniBand Specifica- 
tion. 11 

12 

20.1.1 Product Application 13 

Each product that has InfiniBand content may claim InfiniBand Compli- 14 

ance to one or more of the Categories defined in the Compliance Summa- 1 5 

ries of the InfiniBand Specification. A product shall not simply claim 1 q 
"InfiniBand Compliant". 

Each claim of compliance shall be a list of one or more valid InfiniBand 

Compliance Categories from Volume 1 or Volume 2. It's appropriate for 1^ 

some products to include Compliance Categories from both Volumes 1 20 

and 2. 21 

22 

The valid Volume 1 Compliance Categories are defined below. 23 

24 
25 
26 

20.2 Volume 1 Compliance Categories 27 

28 

Volume 1 Compliance Categories refer to the functionality of each entity 
defined in Volume 1. Table 258 on page 844 lists all valid Volume 1 Com- 29 
pliance Categories along with their full names. 30 

31 

Because optional functionality may be associated with a given Compli- 32 
ance Category, zero or more Compliance Qualifiers may be associated ^3 
with that Category. Table 258 lists all valid Qualifiers under each Category. 
Qualifiers shown in bold italics indicate functionality that is actually non- 
optional for that specific category, but those Qualifiers may still appear in 
some of the Compliance Statements listed under that Category. 36 

37 

Table 259 on page 846 lists Volume 1 's complete set of Qualifiers along 33 
with their full names. Section 20.2.1 discusses Qualifiers in more depth. 39 
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Statements that directly apply to that category. Table 258 provides a refer- 
ence to each section, including a hypertext link with on-line versions of the 
spec. 

Table 258 Volume 1 Compliance Categories 



Category 


Full Name 


Valid Qualifiers 


Reference 


HCA-CI 


Host Channel Adapter - 
Channel Interface 


VLs, RC. UC, RD. RawD, ARM. UDMcast. RawDMcast, RDMA, 
Atomics, P_Key traps, P_Key counters. Notice, Trap 


Section 20.3 
on oaae 848 


TCA 


Target Channel Adapter 


VLs, RC, UC, RD, RawD. APM. UDMcast, RawDMcast. RDMA. 
Atomics. P_Key traps, P_Key counters. Notice, Trap 


Section 20.4 
on pace 860 


sw 


Switch 


VLs, UDMcast, P_Key SRE, P_Key SREJn. P_Key SRE_Out. 
Notice, Trap 


Section 20.5 
on oaae 867 


RTR 


Router 


VLs. RawD, UDMcast, RawDMcast, P_Key SRE, Notice, Trap 


Section 20.6 
on oaae 871 


SM 


Subnet Manager 


Trap 


Section 20.7 
on oaae 874 


SA 


Subnet Administration 


UDMcast, Trap, SAOPT 


Section 20.8 
on pace 875 


CM 


Communication Manager 


APM 


Section 20.9 
on oaqe 876 


PFM 


Performance Manager 


Trap 


Section 20.10 
on paae 876 


VM 


Vendor-Defined Manager 


Trap 


Section 20.11 
on oaae 877 


OMA 


Optional Management Agent 


Trap, AMA. DMA. SNMP, VMA 


Section 20.12 
on pace 877 
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20.2.1 Volume 1 Compliance Qualifiers i 

Compliance Qualifiers indicate which compliance statements apply only if 2 

a product supports an optional feature or specified combination of optional 3 

features. 4 

5 

Some compliance statements apply to multiple Compliance Categories, g 
and thus appear in the Compliance Statement List under each applicable 
Category. Some of these "shared" compliance statements include Quali- 
fiers associated with functionality that is optional in some Categories and ^ 
mandatory in others. In each Category where the functionality is manda- 9 
tory, the associated Qualifier is shown in bold italics for that Category's 10 
"Valid Qualifier's" entry in Table 258. 1 1 

12 
13 

C20-1 : If a product claims to support a given optional feature, the product ^ 4 
must comply with all compliance statements that apply to that optional ^ ^ 
feature. 

16 

For example, an HCA-CI that claims to support Reliable Datagram Ser- ^'^ 
vice must comply with all statements under the HCA-CI Compliance 18 
Statement List that apply, given the RD Qualifier. 19 

20 

A product shall not include in its list of supported optional features any 21 
features that are in fact mandatory for the Category the product claims 
compliance to. Qualifiers for these mandatory features are shown in bold 
italics in Table 258. For example, Reliable Connection Service is manda- 
tory for HCA-CI, so RC must not be included in an HCA-CI's list of sup- 24 
ported optional features even though the product must still meet all RC 25 
requirements. 26 

27 

A product may claim support for multiple optional features, in which case 2g 
the product must comply with all compliance statements that apply to the 
particularset of optional features claimed by the product, noting that some 
compliance statements apply only for specific combinations of qualifiers. 30 

31 

Table 259 lists and describes the Volume 1 Compliance Qualifiers that a 32 
product can claim compliance to. To abbreviate the optional support, one 33 
or more Qualifiers can be listed after the Category in braces. For example, 34 
a Target Channel Adapter that supports Reliable Datagram Service and 
also Automatic Path Migration can be abbreviated with TCA{RD,APM} 

36 
37 
38 
39 
40 
41 
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Table 259 Volume 1 Compliance Qualifiers 



Qualifier 


Description 


VLs 


Pnrt Riinnnrtinn Morp than Onp Dnt;) Vl 


RC 


DpliaKjp r^onnpptinn ^prvipp 


UC 


Mnrpli^^hlp donnprtinn Sprvirp 

uiiiC7iiauiw v/wi II ic^iiwi 1 ^JCiviii^c 


RD 


Rplinhip r)?itjinr3m ^Prvirp 




Raw natanram ^ppuipp 
r\ciw uaia\^iani ociviiyC 


ARM 


Aiitninatir Path Minratinn 
rAUiiJiiiciiiVi' ndiii iVii^idUUll 




1 IrtrpllaKIp F^atanram Miiltinacf 
uiiiwiiciuic L/aio^i di 1 1 iviuiULrdol 


r\ci w Ly 1 VI ucio I 


Rauu r^afanram IV^i iltir>act 
r\dw L^dia^iaiii iViuiUL«dol 


RDMA 


Rpmntp nirppt N^Pirmrv Ar^r^pcc 


Atr»mir*c 


Atomic* Onpratirtnc 


1 1 \Cy O In [_ 


P Wfi\f Fnfrirr^pmpnt h\/ Q\A/it^*hpc f\r Roi itorc 
1 rvcy ^1 MLii LrCi 1 ici H uy OWIl^,fllC^O \Jl rxVJUlClo 


P K^v ^RF In 
1 r\cy or\ L_ iii 


InhoiinH P ^pv/ Fnfr*rr*pnnpnt K\/ QvA/itr'hoc r\r Rniitarc 
inuUUlIU r rvcy CIllUrLrClliciU Uy OWilLfi16!> Of r\UUl6ro 


P Kpw <5RF Out 


^unjuuiiu n rxcy ciiiorocincni Dy owiicnss or r\ouiers 


P l^ov/ franc 


lldp OciicldllUII lUI r_»\"y VlL/ldllOllo 


1 r\c?y vuuiiidd 


v^uuiiicro lor r fxcy vioiduurio 


Notice 


Standard Format & Queue for Data About Events 


Trap 


Asynchronous Event Notification 


SAOPT 


Subnet Administration Bulk Update Facilities 


AMA 


Application-specific Management Agent 


DMA 


Device Management Agent 


SNMP 


SNMP Tunneling Agent 


VMA 


Vendor-specific Management Agent 



20.2.1.2 Compliance Statements with Multiple Qualifiers 

Some compliance statements contain combinations of Qualifiers, and 
apply only if the specified combination is true. For example, a compliance 
statement beginning with "RD and Atomics:" applies only if both RD and 
Atomics are supported. If a compliance statement begins with "RD or 
Atomics:", the statement shall apply if either RD or Atomics is supported. 
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20.2.2 Compliance Statement Lists i 

Within each Compliance Category section is a list of the compliance state- 2 

ments that apply to that particular category. Here is a sample list entry: 3 

4 

• 09-16: RD: PSN Insertion for Reliable Svc Pkts Page 231 5 



20.2.2.1 Hypertext Links 



20.2.3 Common Requirements 



6 
7 
8 



Online versions of this specification have hypertext links present before 
each of the lines in the Compliance Statement lists. These links are indi- 
cated by the at the beginning of the line and will lead to the actual ^ 
statement in the body of the specification that contains the details for each ^ 0 
of the compliance entries. 1 1 

12 

Each Compliance Statement List entry also contains the page number for ^ 3 

use with hard-copy versions of the specification. ^ ^ 

20.2.2.2 Compliance Statement Labels ^ ^ 

1 6 

All formal compliance statements throughout the specification are labeled 
so they can be uniquely identified. Each label begins with either a "C" or 
an "0", indicating whether the compliance statement applies in all cases 

with respect to its category or whether the compliance statement is quali- 19 

fied with respect to optional features. The "0" is uncapitalized to make it 20 

more easily distinguishable from the "C" in Compliance Statement Lists. 21 



22 
23 



The next portion of the label is the number of the chapter in which the 
formal compliance statement appears. The final portion of the label is a 
compliance statement number, which starts with "1" in each chapter. "C" 
and "o" compliance statements are numbered independently. 25 

26 

20.2.2.3 Compliance Statement Titles 27 

Each line within a Compliance Statement List contains a brief title for the 28 

respective compliance statement. Because of the limited space and lack 29 

of context, each title is only intended to convey the topic of the compliance 30 
statement, and not necessarily convey its actual requirements. 

Compliance statements that apply only to optional functionality is indi- 
cated by the presence of one or more Qualifiers at the beginning of the 33 
title, followed by a colon. For example, the above sample Compliance 34 
Statement Title contains the "RD" qualifier. 35 

36 
37 

Some Compliance Categories share common requirements, such as 33 
those that apply to all ports. To avoid unnecessary duplication, certain 39 
common requirement sets have been collected and referenced by the ap- 
propriate Compliance Categories instead of replicating those lists of re- 
quirements under each separate Category. 

42 
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20.3 HCA-CI Compliance Category 



In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of HCA-CI, a product shall meet all require- 
ments specified in this section, except for those statements preceded by 
Qualifiers that the product does not support. In addition, a compliant HCA- 
CI shall meet all Section 20.13 Common Port Requirements on page 878 
and all Section 20.14 Common MAD Requirements on page 879 . 

Some compliance statements in the HCA-CI Category contain require- 
ments that apply to both mandatory and optional features. For instance, 
some compliance statements mention both QPs and EE Contexts, though 
EE Contexts are relevant only if RD Service is supported. In such cases, 
the requirements on an optional feature apply only if the product claims to 
support the optional feature. 



EUI-64 Assignment - At Least One per Port Page 110 

GID Usage and Properties Page 110 

Addressing Rules Page 114 

LID (Local Identifier) Usage and Properties Page 114 

RD: Reliable Datagram ETH Format Page 123 

RDMA: RDMA ETH Format Page 124 

Atomics: Atomic Extended Transport Hdr Format Page 125 

Atomics: Atomic ACK ETH Format Page 126 

RawD: Raw Packet Header Rules Page 128 

RawD: EtherType Usage in RWH Page 128 

RawD: Raw Packet Length Rule Page 128 

RawD: Raw Packet Header Format Page 128 

Packet Discard Required if Link Checks Fail Page 139 

VL15 Buffer(s) required For each Port Page 148 

SL-to-VL Mapping Table Size Page 152 

RawDMcast: Raw Multicast Operational Rules Page 183 

Link Layer DLID Check - Use Base LID Only Page 185 

Rules for Including a GRH in Packets Page 192 

Optional Use of GRH in Packets Page 192 

Transport - Opcode, Header, and Payload Table Page 200 

Solicited Event Bit Invokes CO Event Handler Page 202 

Solicited Event Bit - Excluded from Hdr Validation .... Page 202 

BTH TVer Field Value Page 202 

BTH - Reserve 8 Field Value Page 203 

BTH - Reserve 7 Field Value Page 203 

RD: RDETH - Reserve Field Value Page 204 

DETH - Reserve Field Value Page 204 

RETH - DMA Length Field Value Limits Page 206 

SEND Operation Size Limits Page 209 

Segmentation and Reassembly of RC and UC Page 209 

RD: Segmentation and Reassembly Page 209 
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C9-23: RDMA READ Request - Req'd Headers Page 218 

C9-24: RDMA READ Response - Req'd Headers Page 218 

09-15: Atomics: ATOMIC Op Request - Req'd Headers Page 221 

o9-16: Atomics: ATOMIC Op Response - Req'd Headers Page 221 

09-17: Atomics: ATOMIC Op - QP Atomicity Rule Page 221 

09-18: Atomics: ATOMIC Op - Enhanced Atomicity Rule .... Page 221 

C9-25: Transmission of Requests - Ordering Rule Page 227 

09-26: Transmission of Message - Data Payload Order Page 227 

09-27: Acknowledge Packets - Strong Ordering Page 227 

o9-19: RD: Acknowledge Packets - Strong Ordering Page 227 

09-28: Responder - Order of Request Execution Page 227 

09-29: Receipt of Requests - Order of Completion Page 227 

09-30: Requester - Order of WOE Completion Page 228 

09-31 : Requester - WOE Fence Attribute Behavior Page 228 

09-32: WOE Order of Completion vs. Execution Page 228 

O9-20: RDMA: Responder RDMA WRITE Buffer Rule Page 228 

09-21: RDMA: Responder RDMA READ Buffer Rule Page 228 

09-33: Receive Queue - Buffer Content Validity Page 228 

09-34: Transport Layer - Packet Header Validation Page 228 

09-35: Transport Layer - IBA Packts - Header Validation Page 228 

09-37: BTH Validation - Dest QP and QP State Page 231 
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09-39: BTH Validation - Silent Drop Rule Page 232 

09-40: BTH Validation - Behavior - RD Packet Page 232 

09-41: Transport Layer - BTH P_Key - QPO Rule Page 232 

09-42: Transport Layer - BTH P_Key - QP1 Rule Page 232 

09-43: Transport Layer - Required P_Key Validation Page 233 

o9-24: RD: Transport Layer - Required P_Key Validation Page 233 

09-44: GRH - NxtHdr Field - Validation Page 233 

09-45: GRH - IPVers Field - Validation Page 233 

09-46: GRH - Dest QP UD, SGID/DGID non-Validation Page 234 

09-47: GRH - SGID/DGID Field - Validation Page 234 

09-25: RD: RDETH - EE Context Field Value - Validation .... Page 235 

09-26: RD: RDETH - EE Context - Validation Behavior Page 235 
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09-49: DETH -Q_Key Field Value -QP1 Rule Page 236 

09-50: DETH - Q_Key Field Value - Validation Page 236 

09-51 : Transport Layer - ACK depends on Valid Keys Page 236 

09-52: Transport - LRH - SLID/DLID Field Validation Page 237 
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09-59: Packet Validation - IBA Unreliable Multicast - QP Page 238 
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09-32: RD: BTH - Initial PSN for RD Service Page 248 

09-66: Requester - PSN Value - RC Service Page 249 

o9-33: RD: Requester - PSN Value - RD Service Page 249 

09-34: RD: Validation of EEC RDD Against Send Queue Page 250 
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RD: EEC vs QP - RDD Mismatch Behavior Page 250 

Requester - BTH OpCode Field Value Rules Page 250 

Requester - BTH OpCode Field Value Table Page 250 

Requester - Packet PayLen - First/Middle Page 250 

Requester - Packet PayLen - Only Page 251 

Requester - Packet PayLen - Last Page 251 

Requester - RETH DMALen Field - Limits Page 251 

Responder - Validation of Inbound RC Requests Page 251 

RD: Responder - Validation of Inbound RD Req Page 251 
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Responder - New Request - RC Exec/Response Page 254 

RD: Responder - New Req. - Exec/Response Page 254 

Responder - Valid Duplicate Req Behavior Page 254 

RD: Responder - Valid Duplicate Req Behavior Page 254 
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RD: Resp. - Behavior after NAK Sequence Error Page 256 

Responder - Validation of OpCode Seq Page 257 

RD: Responder - Validation of OpCode Seq Page 257 

RD: Responder - New Request Rule Page 257 
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Resp. - Request of Unsupported Fen - Behavior Page 258 

RD: Resp. - Request of Unsupported Fen Page 258 

Resp. - Reserved OpCode Error - Behavior Page 258 
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R_Key Violation Behavior Page 260 

R_Key Violation Behavior - Completion Rule Page 260 

LRH - PktLen Validation - WQE buffer Page 260 
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Recreate READ Response Req. - Order Page 271 

Atomics: Duplicate ATOMIC Op Req. Behavior Page 272 

Atomics: Duplicate ATOMIC Op Req. Error Page 273 

Atomics: Duplicate ATOMIC Req. - Local Error Page 273 

NAK PSN Field Value - Except for RDMA READ Page 273 

NAK PSN Field Value - RDMA READ Page 273 

RNR NAK - PSN Field Value Page 273 

Wait for first valid ePSN after Sequence En-or Page 273 

Response to Duplicate Requests - except NAK Page 273 

BTH AckReq Field - Behavior Page 274 

PSN Field Value - RDMA READ Response Page 275 

AETH requirement Page 275 

AETH Syndrome - Defined Values Page 277 

RD: AETH Syndrome - Defined Values Page 277 

RD: AETH Syndrome Value - MSN Invalid Page 277 

Request - Malfonned ACK Message Rule Page 278 

Responder - PSN Field Value - Sequence Error Page 279 

NAK Sequence Error - Subsequent Behavior Page 279 

PSN Field Value - Duplicate Request - Behavior Page 279 

BTH Field Value - NAK Remote Access En-or Page 279 
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RD: AETH Field Value - RNR NAK Timer Page 282 

RNR NAK Retry - Counting and Behavior Page 282 
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Packet Header Validation - Transport Page 284 
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Timeout Rule - Based on Timeout Interval Page 289 
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End-to-End Flow Control Credit - Dupl. ACKs Page 290 

Duplicate ACKs - Behavior Page 290 

Reliable Connection and Reliable Service Page 291 
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Responder - MSN Calculation Page 293 

AETH MSN Field Value Page 295 

Receive Queue - End-to-End Flow Control Credit .... Page 296 

End-to-End Flow Control - Send Queue Behavior .... Page 297 
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End-to-End Flow Control Rules Page 298 

End-to-End Credit - Usage Page 298 

End-to-End Flow Control - Lack of Initial Credit Page 298 

End-to-End Flow Control Credit - Calc/Update Page 301 
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End-to-End Flow Control Credit - AETH Encoding Page 301 

Requester - Send Queue Behavior - Credit Limit Page 301 

Requester Behavior - Transaction Ordering Rules .... Page 302 

End-to-End Flow Control - Encoded Count Page 302 

Requester Behavior - Send Queue - WQE Limit Page 304 

SEND Request - Limited WQE Case - 1 Pkt Page 304 

RDMA WRITE - Request Xmt - AckReq bit Page 305 

Requester - Ability to Receive Unsolicited ACK Page 305 

RD: QP Availability, Capabilities Page 307 
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RD: Error Detection and Handling Page 308 

RD: Communication Management Support Page 308 
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RD: SEND/READ/WRITE Support Page 308 

RD: EEC Management Support Page 308 

RD: RDD Domain Support Page 309 
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Transport - Packet Header Validation Page 316 
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UC Service - OpCode Examination Page 317 
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RETH R_Key Validation - Behavior Page 317 

Inbound Request Packet - Validation - UC & UD Page 317 
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Invalid OpCode Behavior - UC Page 325 
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PSN Generation and Message Completion - UD Page 332 

PSN Calculation - Unreliable Datagram Page 332 
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Responder - PSN Treatment - UD Page 332 

BTH Opcode Field Value - Validation - UD Page 333 

Inbound SEND Request - Queue Entry - UD Page 333 

Packet Headers - Raw vs. IPv6 NxtHdr Page 334 
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RD: Requester, Transmit - Locally Detected En-or .... Page 338 

Requester - Excessive Retry Detection - RC Page 338 

RD: Requester - Excessive Retry Detection Page 338 
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Base GUID reqs for first entry in GID Table Page 371 
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Region/Window/Addr_Handle & PD associations Page 373 

PD check between QP & Region/Window Page 373 

PD check between UD QP & Addr_Handle Page 373 
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O10-6: RD: RD QP and RDD associations Page 379 

O10-7: RD: EEC and RDD associations Page 379 

0IO-8: RD: RDD check between RD QPs & EECs Page 379 

o10-9: RD: mininnuni number of RDDs Page 380 

0IO-IO: RD: deallocating RDD while RDD still associated Page 380 

C10-19: Req'd States & Transitions for QPs Page 386 

0IO-II: RD: Req'd States & Transitions for EECs Page 386 

C10-20: Initial State fornewly created QP/EE Page 386 

010-21 : Default attributes for QP/EE and when to set Page 386 

0IO-I2: RD: SQ WRs referencing EEC in Reset State Page 386 

o10-13: RD: incoming msgs targeting EEC in Reset State .... Page 386 

010-22: WR submitted to QP in Reset State Page 387 

010-23: Incoming messages targeting QP in Reset State Page 387 

o10-14: RD: incoming msgs targeting EEC in Init State Page 387 

O10-15: RD: SQ WRs referencing EEC in Init State Page 387 

010-24: RQ WRs submitted to QP in Init State Page 387 

010-25: SQ WRs submitted to QP in Init State Page 387 

010-26: Incoming msgs targeting QP in Init State Page 388 

010-27: RQ WRs posted to QP in RTR State Page 388 

010-28: Incoming msgs targeting QP in RTR State Page 388 

0IO-I6: RD: incoming msgs targeting EEC in RTR State Page 388 

O10-17: RD: SQ WRs referencing EEC in RTR State Page 388 

010-29: SQ WR posted to QP in RTR State Page 388 

010-30: WRs posted to QP in RTS State Page 389 

010-31 : WRs processed by QP in RTS State Page 389 

010-32: Incoming msgs targeting QP in RTS State Page 389 

0IO-I8: RD: incoming/outgoing msgs w/ EEC in RTS Page 389 

010-33: WR posting to QP in SQD State Page 389 

010-34: Incoming msgs targeting QP in SQD State Page 389 

O10-19: RD: incoming msgs targeting EEC in SQD State Page 389 

010-35: Reqs for QP/EE transitioning to SQD State Page 389 

010-36: AAEvent generation after transition to SQD State .... Page 389 

0IO-2O: RD: SQ WRs referencing EEC in SQD State Page 390 

010-37: SQ WRs posted to QP in SQD State Page 390 

010-38: RQ WRs on / posted to QP in SQEr State Page 390 

010-39: WC for SQ WR that caused transition to SQEr Page 390 

010-40: SQ WRs subsequent to one causing SQEr Page 391 

010-41 : WC for WR that caused transition to Error State Page 391 

010-42: WRs subsequent to one causing Error State Page 392 

0IO-2I : RD: SQ WR referencing EEC in Error State Page 392 

O10-22: APM: req'd path migration States & Transitions Page 393 

O10-23: APM and RD: req'd path mig States & Transitions .... Page 394 

O10-24: APM: reqs for Migrated to Reamri transition Page 394 

O10-25: APM: reqs for Armed to Migrated transition Page 394 

O10-26: APM: initial path migration State Page 395 

O10-27: APM: behavior for Armed to Migrated transition Page 395 

O10-28: APM: handling valid incoming migration requests .... Page 395 

o10-29: APM: handling invalid incoming mig requests Page 396 

0IO-3O: APM: reqs for CI causing transition to Migrated Page 396 

O10-31 : APM: reqs for transition from Migrated to Rearm Page 396 

O10-32: APM: req'd behavior following transition to Rearm .... Page 396 

O10-33: UDMcast: minimum number of mcast groups Page 397 

O10-34: UDMcast: reqs for QP to receive mcast msgs Page 397 

O10-35: UDMcast: reqs if incoming Dest QPN not valid Page 399 

010-43: Preparing/specifying mcast group dest address Page 399 

O10-36: UDMcast: reqs for UD Multicast loopback Page 399 

010-44: Accesses to unregistered memory locations Page 400 

010-45: Registrations succeed or fail in atomic fashion Page 400 

010-46: Req'd memory access rights Page 401 

O10-37: Atomics: support for Remote Atomic access right .... Page 401 

010-47: Local Read memory access right is automatic Page 401 

010-48: Rem Write or Rem Atomic require Local Write Page 401 
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C10-49: Registration of overlapping memory areas Page 403 

C10-50: Registering arbitrarily aligned buffers Page 404 

C10-51 : Registering arbitrary length buffers Page 404 

C10-52: Reqs for pinning Memory Region pages Page 405 

C10-53: Physical buffer alignment & length reqs Page 406 

C10-54: General reqs wrt local accesses to Regions Page 406 

C10-55: Local access rights checking against Region Page 406 

C10-56: General reqs wrt remote accesses to Regions Page 407 

C10-57: Remote access rights checking against Region Page 407 

C10-58: Deregistration of overlapping Memory Regions Page 407 

C10-59: Deregistration while access in progress Page 407 

C10-60: Accesses after deregisteration completes Page 407 

C10-61: Granularity of remote access control to Windows Page 410 

C10-62: Access rights checks for Bind Window Page 410 

C10-63: Previous R_Key after a Bind Window completes Page 411 

C10-64: Execution of SQ WRs subsequent to a Bind Page 411 

C10-65: Zero-length Bind semantics Page 412 

C10-66: Reqs for Windows bound to same Region Page 412 

C10-67: Window changes while access in progress Page 412 

C10-68: Previous binding after a Bind completes Page 412 

C10-69: Window deallocation while access in progress Page 412 

C10-70: Accesses after Window deallocation completes Page 412 

C10-71: Reqs if CI allows orphaned Windows Page 413 

C10-72: Bind-time PD check between QP & Window Page 413 

CI 0-73: Bind-time check that Region allows Binds Page 413 

C10-74: Bind-time check of Region write permissions Page 413 

C10-75: Bind-time check of Region addr bounds & PD Page 413 

C10-76: Window access-time PD check of QP & Window Page 413 

C10-77: Window access-time checks of bounds & rights Page 413 

C10-78: Window access-time checks for each page Page 413 

C10-79: Special access-time checks if orphaned Window Page 414 

C10-80: Service Types supporting Send and Receive Page 415 

C10-81: RQ WR consumption with incoming Send msg Page 415 

C10-82: Service Types req'd to support SAR Page 415 

O10-38: RD: SAR support Page 415 

C10-83: RDMA Read support on RC Page 415 

C10-84: RDMA Write support on RC and DC Page 415 

O10-39: RD: RDMA Read & RDMA Write support Page 415 

C10-85: RQ WR with incoming RDMA Read Page 415 

C10-86: RQ WR with incoming RDMA Write Page 415 

C10-87: RQ WR with incoming RDMA Write; more cases Page 415 

01 0-88: Target QP check with incoming RDMA Page 416 

O10-40: Atomics: required Atomic operations Page 416 

O10-41: Atomics: Endian byte ordering accommodation Page 416 

O10-42: Atomics: Fetch&Add requirements Page 416 

O10-43: Atomics: if remote addr not 64-bit aligned Page 416 

O10-44: Atomics: support on RC Page 416 

O10-45: Atomics and RD: Atomic support on RD Page 416 

O10-46: Atomics: Service Types not supported Page 417 

O10-47: Atomics: op result returned in Data Segment Page 417 

O10-48: Atomics: atomicity of requests through same HCA. . . . Page 417 

O10-49: Atomics: atomicity of requests across system Page 417 

010-89: Service Types supporting Bind Memory Window Page 418 

olO-SO: RD: Bind Memory Window support Page 418 

010-90: Signaled and unsignaled completion support Page 418 

010-91: Reqs for generating CQE when a WR completes Page 418 

010-92: Conditions for not generating CQE for SQ WR Page 419 

010-93: If max msg payload size exceeded for RC or UC Page 419 

olO-SI : RD: If max msg payload size exceeded for RD Page 419 

010-94: Scatter list support for Receives & RDMA Reads Page 419 

010-95: Gather list support for Sends & RDMA Writes Page 419 

010-96: Rules for SQ WR processing Page 420 
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C10-97: Rules for RQ WR processing Page 420 

C10-98: WRs to single queue initiated in order Page 421 

C10-99: RQ WR completion order rule; RD exception Page 421 

O10-52: RD: SQ WRs complete in order Page 422 

O10-53: RD: rule for RQ WRs completing in order Page 422 

C10-100: Fence Indicator effect on SQ WRs Page 422 

CI 0-1 01: Table of ordering rules for WRs on same SQ Page 423 

C10-102: WQ WCs placed on associated CQ Page 424 

C10-103: Given WC not retrieved more than once Page 425 

CI 0-1 04: WR with signaled completion generates WC Page 425 

CI 0-1 05: SQ WR completing in error generates WC Page 425 

C10-106: RQ WR completion generates WC Page 425 

CI 0-1 07: WR buffer access once associated WC retrieved Page 425 

o10-54: RD: WR freed resource count returned with WC Page 426 

C10-108: Buffer access rule for Unsignaled WRs Page 426 

CI 0-1 09: Single CQ Event Handler per HCA Page 426 

C10-110: CQ Event Handler replacement Page 426 

CI 0-1 11 : Outstanding Completion Event notify requests Page 427 

C10-112: Rule for Completion Event generation Page 427 

C10-113: Rule for when not to generate Completion Event Page 427 

C10-114: Completion Event indicates responsible CQ Page 427 

C10-116: Invalid P_Key definition & use in table entry Page 428 

C10-119: Received packet discarded if P_Key mismatch Page 429 

CI 0-1 20: Each port contains P_Key table Page 429 

C10-121: P_Key Table size reqs Page 429 

CI 0-1 22: Mechanisms to change P_Key Table contents Page 429 

CI 0-1 23: P_Key Table initialization wrt non-volatile storage .... Page 430 

C10-124: P_Key checking for incoming packets Page 430 

O10-55: P_Key traps: general requirements Page 430 

O10-56: P_Key traps: new violation before trap sent Page 431 

O10-57: P_Key counters: general requirements Page 431 

C10-125: P_Key and P_Key Table associations with QPs Page 431 

CI 0-1 26: P_Key for packets from a QP's SQ; exceptions Page 431 

CI 0-1 27: Incoming packet P„Key checking against QP Page 431 

O10-58: RD: P_Key & P_Key Table association with EEC Page 432 

O10-59: RD: P_Key attachment & checking wrt EEC Page 432 

C10-128: Response to SMP requesting P_Key change Page 432 

C10-129: Timing req for using updated P_Key Table values .... Page 432 

C10-131: No P_Key checking on packets sent to SMI Page 433 

CI 0-1 32: Special P_Key checking for packets sent to GSI Page 433 

C10-133: P_Key for packets sent from GSI Page 433 

CI 0-1 35: Immediate error return of control timing Page 434 

CI 0-1 36: WR with immediate error not posted to WQ Page 434 

CI 0-1 37: WC for WR completed in error Page 434 

CI 0-1 38: Async en-ors before/after event handler regist Page 435 

CI 0-1 39: Async error handler registration & replacement Page 435 

C10-140: Unaffiliated Async Error handling requirements Page 435 

C10-141: RC immediate error effect on QP processing Page 436 

C10-142: RC SQ completion error effect on SQ & WR Page 436 

CI 0-1 43: RC RQ completion error effect on QP & WRs Page 436 

CI 0-1 44: RC AAError effect on QP & WRs Page 436 

CI 0-1 45: Tables - CompI error handling for RC SQs/RQs Page 436 

olO-BO: RD: immediate error effect on QP/EE processing Page 437 

010-61: RD: SQ completion en^or effect on SQ & WR Page 437 

O10-62: RD: RQ compI error effect on curr/subseq WRs Page 438 

O10-63: RD: completion en^or effect on EEC State Page 438 

o1 0-64: RD: AAError effect on QP & WRs Page 438 

olO-SS: RD: AAError effect on QP and EEC Page 438 

010-66: RD: Tables - compI error handling for RD SQ/RQ Page 438 

CI 0-1 46: DC immediate error effect on QP processing Page 441 

C10-147: UC completion error effect on SQ & WR Page 441 

C10-148: UC RQ completion error effect on QP & WRs Page 441 
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C10-149: Tables - compi error handling for UC SQ/RQ Page 441 

C10-150: UC AAError effect on QP & WRs Page 442 

C10-151: UD immediate error effect on QP & WR Page 442 

C10-152: UD SQ completion error effect on SQ & WR Page 442 

C10-153: UD RQ completion error effect on QP & WRs Page 443 

C10-154: Tables - completion error handling on UD SQ/RQ Page 443 

C10-155: UD AAError effect on QP & WRs Page 443 

C10-156: RawD immediate en^or effect on QP & WR Page 443 

C10-157: RawD SQ completion enror effect on SQ & WR Page 444 

C 10-1 58: RawD RQ completion error effect on QP & WRs Page 444 

C10-159: Tables - compI error handling for RawD SQ/RQ Page 444 

C10-160: RawD AAEn-or effect on QP & WRs Page 445 

C11-1: Table indicating mandatory verbs Page 446 

C11-2: Table indicating verbs req'd for optional features Page 446 

C11-3: Verb functionality not indicated as being optional Page 446 

C1 1-4: Verb functionality associated w/ optional features .... Page 446 

C1 1-5: HCA handles for different HCAs are unique Page 449 

C1 1-6: If Open HCA called for already opened HCA Page 449 

C1 1-7: Create QP required initial attributes Page 459 

C11-8: Modify QP's general behavior Page 461 

C1 1-9: Modify QP's behavior if invalid request Page 461 

C1 1-10: Table - QP State Transition Properties Page 461 

C11 -11 : Destroy QP deallocates associated resources Page 469 

C11-12: Destroy QP's effect on WRs & incoming ops Page 469 

C11-13: Get Special QP supports QPO & QP1 Page 470 

oll-l : RawD: Get Special QP supports RawD QPs Page 470 

C11-14: Get Special QP reqs wrt SMI & GSI QP handles Page 470 

oil -2: RawD: Query HCA returns # suppt'd RawD QPs Page 470 

C11-15: Special rule for CQs associated with SMI/GSI Page 470 

C11-16: Resize CQ requirements Page 473 

C1 1-17: Destroy CQ behavior if any WQs still associated Page 474 

oil -3: RD: Modify EEC Attributes general behavor Page 475 

oil -4: RD: Table - EEC State Transition Properties Page 476 

oil -5: RD: Destroy EEC's effect on WRs Page 481 

C11-18: Rereg MR behavior if "Operation denied" Page 487 

C11-19: Rereg MR behavior if invalid handle Page 487 

C1 1-20: Rereg MR behavior if other erors Page 487 

C11-21: Rereg MR behavior if access in progress Page 487 

C11-22: Rereg Phys MR inherits Rereg MR reqs Page 489 

C11-23: Post Send requirement for return of control timing . . . Page 497 

C11-24: WR access or modification after posting Page 497 

C11-25: Post Send Table - req'd ops for service types Page 497 

C11-26; Post Send table - req'd input modifiers for ops Page 498 

C11-27: Post Receive req for return of control timing Page 501 

C11-28: Poll for Comp table - completion err types for SQs. . . . Page 503 

C11-29: Poll for Comp table - completion err types for RQs. . . . Page 504 

C11-30: Rqst Comp Notif req'd Completion Event Types Page 506 

C1 1-31: Rqst Comp Notif changing to "next" completion Page 507 

C11-32: Req Comp Notif not changing from "next" comp Page 507 

C1 1-33: Set Async EH req'd use of new event handler Page 508 

C11 -34: Affilliated Async Error effect on QP/EE State Page 51 3 

C11-35: Affilliated Async Event effect on QP/EE State Page 513 

C11-36: Communication Established AAEvent generation Page 514 

C11-37: AAError - CQ Enror generation condition & timing .... Page 514 

C11-38: AAError - CQ Error generation timing for overrun Page 514 

C1 1-39: AAError - Catastrophic en-or condition & timing Page 514 

C1 1-40: AAError - Catastrophic error condition & timing Page 514 

011- 6: APM: /AAError - APM error condition Page 514 

C1 1 -41 : UAError - Catastrophic En^or conditions Page 51 5 

C1 1-42: UAError - Port Error condition Page 515 

C12-1: CM protocol support req'd with RC, UC, and RD Page 520 

012- 12: Conditions when SIDR_REQ msg support req'd ..... Page 521 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 



InfiniBand^"^ Trade Association 



Page 857 

Exhibit A, Amendment Under Rule 1 16 filed Dec. 21, 2007, 09/905,067 



InfiniBand^'^ Architecture Release 1 .0 
Volume 1 - General Specifications 



Volume 1 Compliance Summary 



October 24, 2000 
FINAL 



C 13-28 
C 13-29 
CI 3-30 
013-1: 
013-2: 
CI 3-32: 
013-3 
013-4 
013-5 
013-6 
013-7 
013-8 
013-9 
O13-10: 

013- 11: 
CI 3-33: 
CI 3-36: 
CI 3-37: 
CI 4-1 4: 
CI 4-1 5: 
CI 4-1 6: 
C14-17: 
C 14-1 8: 
C14-19: 
C 14-20: 
CI 4-21: 
C14-22: 
C14-23: 
C 14-24: 
CI 4-25: 
CI 4-26: 
C14-27: 
C14-28: 
C14-29: 
C14-30: 
C14-31: 
C14-32: 

014- 1 
014-2 
014-3 
014-4 
014-5 
014-6 
C 14-33: 
CI 4-34: 
014-7: 
014-8: 
014-9: 
O14-10 
014-11 
014-12 
CI 6-1: 
CI 6-2: 
CI 6-3: 
C16-4: 
CI 6-5: 
016-1: 
C16-6: 
C16-7: 
016-2: 
CI 6-9: 
CI 6-10: 



ClassPortlnfo Required for each Mgt Class Page 589 

ClassPortlnfo Required For Each GS Class Page 589 

SA ClassPortlnfo Required Page 589 

Notice: Notice Data Layout Page 592 

Notice: Informlnfo Data Layout Page 593 

No Traps Without TrapDLID Target Page 594 

Trap: Maximum Rate of Generation Page 595 

Trap: Use of Notice Attribute Page 595 

Trap: Transaction ID setting Page 595 

Trap: Response to TrapRepress Page 595 

TrapRepress Dropped if No Matching Trap Page 595 

Notice: Notice Queue is FIFO Page 596 

Notice: Action When Too Many Notices Requested . . . Page 596 

Notice: Meaning of NoticeCount Page 596 

Notice: Response to Set(Notice) Page 596 

SM MADs (SMPs) appear on Port 0 Page 601 

SMPs Not Dispatched to SMA Appear on QPO Page 601 

SMP Processing Above/Below the Verb Layer Page 601 

Subnet Management Agent Required Methods Page 623 

M_Key not Checked When Portlnfo:M_Key = 0 Page 624 

M_Key checks when Portlnfo:M_Key is not zero Page 624 

Lease Period Timer Countdown Page 624 

Portlnfo:M_KeyViolations Counting Page 625 

Lease Period Counting Halts on valid M_Key Page 625 

M_KeyProtectBits When Lease Period Expires Page 625 

M_KeyLeasePeriod 0 = Lease Never Expires Page 625 

M_Key, ProtectBits, & LeasePeriod Set Together Page 626 

Init of M_Key, ProtectBits & LeasePeriod Page 626 

SMA Required Attributes Page 626 

Portlnfo Set when M_Key is 0 Page 651 

Portlnfo Set when M_Key is not 0 Page 651 

Requests to Change RO Registers Are Ignored Page 651 

SubnGetResp Generation when M_Key is 0 Page 651 

SubnGetResp Generation when M_Key nonO Page 652 

SubnGetResp Content Page 652 

SubnGetResponse TransactionID Page 652 

SubnGetResp MADHeader:M_Key Page 652 

Trap: SubnTrap M_Key field Page 652 

Trap: Trap Generation Interval Page 652 

Trap: Only Sent When Portstate is Active Page 652 

Trap, Notice: Monitoring PortState Page 653 

Trap: Trap 1 28 on Port State Change Page 653 

Notice: Logged on Port State Change Page 653 

P_Key and Q_Key Mismatches Monitored Page 653 

P_Key or Q_Key Violation Count Reporting Page 653 

Trap: P_Key, Q_Key Violation =Trap 257, 258 Page 653 

Notice: Must Log P_Key & Q_Key Violations Page 653 

Trap: trap 256 On M_Key Mismatch Page 654 

Notice: M_Key mismatch is logged Page 654 

Trap: trap 129, 130, or 131 When Link Problems Page 654 

Notice: Must Log Link Problems Page 654 

PM Agent is mandatory on all nodes Page 717 

PM MADs Follow Common MAD Format & Usage Page 718 

PortSamplesControl, PortSamplesResult Req'd Page 721 

Each sampler must have >=1 & <=1 5 counters Page 721 

PMA Mandatory quantities: all ports, all nodes Page 726 

Optional Performance Counters: Attributes Page 726 

PortCounters Attribute is Mandatory Page 732 

Counters Init at 0 and Saturate at All Is Page 732 

Optional Performance Counters: Use Page 736 

BMA Mandatory on all nodes Page 749 

BM datagrams follow common MAD format and use . . Page 751 
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016- 3: 
C16-11: 
C16-12: 
C16-13: 
C17-1 
C17-2 
C17.3 
C17-4 
C17-5 
C17-6 
C17-7 
C17-8 
C17-9 
C17-10 
C17-11: 
C17-12 
C17-13 
C17-14 
C17-15 
C17-16 

017- 2: 
017-3: 
017-4: 
C17-18 
C17-19 
C17.20 
C17-21 
C17-22 
C17-23 
C17-24 



Trap: BM Trap DataDetails Layout Page 756 

BMA checks B_Key Page 759 

BMA Action when B_Key check Fails Page 760 

B_Key, B_Key Protection, B_Key lease at reset Page 760 

Verbs Layer (Channel Interface) is Mandatory Page 790 

Multiport CAs Shall Support Multiple Subnets Page 792 

Association of QPs with Ports Page 794 

Static Rate Control - Ports above 2.5 Gbps Page 796 

CA Ports Must Validate P_Keys on Packets Page 796 

P_Key Table Size per Port Page 796 

Setting P_Key Table - No OS Involvement Page 796 

Each Port Must Support at least One GID Page 796 

All QPs Shall Source and Sink Local Packets Page 801 

Except QPO, All QPs Shall Handle GRH Packets Page 801 

UD, RC and UC Transport Required on All QPs Page 801 

Transport Services - Support Rules Page 801 

Solicited Event Rule Page 801 

MTU Support - Valid Sets Page 801 

Receive Queues - E-to-E Flow Control Credit Page 801 

Send Queues - E-to-E Flow Control Credit Page 801 

UDMcast: Generation Page 801 

UDMcast: Receiving Page 802 

APM: Respond to. Generate Auto Path Migrate Page 802 

Loopback Allowed, but Can't Go on the Wire Page 802 

Backpressure Rule to avoid Deadlock Page 802 

Backpressure Inbound/Outbound - Deadlock Page 802 

Inbound Pkts - Link/Network/Transport Check Page 802 

EUI-64 GUID In Non-Volatile Memory Page 802 

Priority of Attached SM - Non-Volatile Memory Page 802 

QPO and QP1 Support Req'd for Every Port Page 803 
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20.4 TCA Compliance Category 



In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of TCA, a product shall meet ail requirements 
specified in this section, except for those statements preceded by Quali- 
fiers that the product does not support. In addition, a compliant TCA shall 
meet all Section 20.13 Common Port Requirements on oaae 878 and all 
Section 20.14 Common MAD Reouirements on page 879 . 



C4-2: 
C4-3: 
C4-4: 
C4-5: 
o5-1: 
o5-2: 
05-3: 
o5-4: 
o5-5: 
o5-6: 
05-7: 
o5-8: 
07-7: 
07-21 
o7-5: 
07-14; 

07- 66 

08- 1: 

08- 1: 

09- 2: 
o9-1: 
09-4: 
09-5: 
09-6: 
09-7: 
o9-2: 
09-8: 
o9-3: 
09-10 
09-5: 
09-12 
09-13 
09-14 
09-15 
o9-6: 
o9-7: 
o9-8: 
o9-9: 
O9-10 
09-11 
09-12 
09-13 
09-14 
09-15 
09-16 
09-17 
09-18 
09-25 
09-26 
09-19 
09-19 



EUI-64 Assignment - At Least One per Port Page 110 

GID Usage and Properties Page 110 

Addressing Rules Page 114 

LID (Local Identifier) Usage and Properties Page 114 

RD: Reliable Datagram ETH Format Page 123 

RDMA: RDMA ETH Format Page 124 

Atomics: Atomic Extended Transport Hdr Format Page 125 

Atomics: Atomic ACK ETH Forniat Page 126 

RawD: Raw Packet Header Rules Page 128 

RawD: EtherType Usage in RWH Page 128 

RawD: Raw Packet Length Rule Page 128 

RawD: Raw Packet Header Format Page 128 

Packet Discard Required if Link Checks Fail Page 139 

VL15 Buffer(s) required For each Port Page 148 

SL-to-VL Mapping Table Size Page 1 52 

RawDMcast: Raw Multicast Operational Rules Page 183 

Link Layer DLID Check - Use Base LID Only Page 1 85 

Rules for Including a GRH in Packets Page 192 

Optional Use of GRH in Packets Page 1 92 

Transport - Opcode, Header, and Payload Table Page 200 

Solicited Event Bit may Invoke CO Event Handler .... Page 202 

Solicited Event Bit - Excluded from Hdr Validation Page 202 

BTH TVer Field Value Page 202 

BTH - Reserve 8 Field Value Page 203 

BTH - Reserve 7 Field Value Page 203 

RD: RDETH - Reserve Field Value Page 204 

DETH - Reserve Field Value Page 204 

RDMA: RETH - DMA Length Field Value Limits Page 206 

SEND Operation Size Limits Page 209 

Segmentation and Reassembly of RC, UC, RD Page 209 

SEND Operation - UD Allows only Single Packets Page 209 

Multi-Packet Messages - Do Not Interleave Page 211 

SEND Request - Required IBA Headers Page 211 

SEND Response - Required IBA Headers Page 211 

RDMA: RDMA WRITE - DMA Length Limits Page 212 

RDMA: RDMA WRITE Segmenting/Reassembly Page 212 

RDMA: Multi-packet RDMA WRITE Rule Page 214 

RDMA: RDMA WRITE Request - Req'd Headers Page 214 

RDMA: RDMA WRITE Resp. - Req'd Headers Page 215 

RDMA: RDMA READ Segments/Reassembly Page 215 

RDMA: RDMA READ DMA Length Limits Page 215 

RDMA: RDMA READ Request - Req'd Headers Page 218 

RDMA: RDMA READ Response - Reqd Headers Page 218 

Atomics: ATOMIC Op Request - Req'd Headers Page 221 

Atomics: ATOMIC Op Response - Req'd Headers .... Page 221 

Atomics: ATOMIC Op - QP Atomicity Rule Page 221 

Atomics: ATOMIC Op - Enhanced Atomicity Rule .... Page 221 

Transmission of Requests - Ordering Rule Page 227 

Transmission of Message - Data Payload Order Page 227 

RC: Acknowledge Packets - Strong Ordering Page 227 

RD: Acknowledge Packets - Strong Ordering Page 227 
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C9-28: Responder - Order of Request Execution Page 227 

C9-29: Receipt of Requests - Order of Completion Page 227 

C9-30: Requester- Order of WQE Completion Page 228 

09-31 : Requester - WQE Fence Attribute Behavior Page 228 

09-32: WQE Order of Completion vs. Execution Page 228 

O9-20: RDMA: Responder RDMA WRITE Buffer Rule Page 228 

09-21: RDMA: Responder RDMA READ Buffer Rule Page 228 

09-33: Receive Queue - Buffer Content Validity Page 228 

09-34: Transport Layer - Packet Header Validation Page 228 

09-35: Transport Layer - IBA Packts - Header Validation Page 228 

09-37: BTH Validation - Dest QP and QP State Page 231 

09-22: RD: BTH Validation - Dest QP and State vs. EEC .... Page 231 

09-23: UDMcast: Well-known Destination QP Value Page 231 

09-38: BTH Validation - Request Checked vs. QP Page 231 

09-39: BTH Validation - Silent Drop Rule Page 232 

09-40: BTH Validation - Behavior - RD Packet Page 232 

09-41 : Transport Layer - BTH P_Key - QPO Rule Page 232 

09-42: Transport Layer - BTH P_Key - QP1 Rule Page 232 

09-43: Transport Layer - Required P_Key Validation Page 233 

o9-24: RD: Transport Layer - Required P_Key Validation Page 233 

09-44: GRH - NxtHdr Field - Validation Page 233 

09-45: GRH - IPVers Field - Validation Page 233 

09-46: GRH - Dest QP UD, SGID/DGID non-Validation Page 234 

09-47: GRH - SGID/DGID Field - Validation Page 234 

09-25: RD: RDETH - EE Context Field Value - Validation Page 235 

09-26: RD: RDETH - EE Context - Validation Behavior Page 235 

09-48: DETH - Q_Key Field Value - Ignored for QPO Page 236 

09-49: DETH - Q_Key Field Value - QP1 Rule Page 236 

09-50: DETH - Q_Key Field Value - Validation Page 236 

09-51 : Transport Layer - ACK depends on Valid Keys Page 236 

09-52: Transport - LRH - SLID/DLID Field Validation Page 237 

09-53: Transport - LRH - DLID Field Validation Page 237 

09-54: Transport - LRH - SLID Field Validation Page 237 

09-55: Transport - LRH Validation - Permissive LID Rule .... Page 237 

09-56: Transport Layer - SLID Invalid if Multicast Page 237 

09-27: RC or UC: Transport Layer - LID Validation Page 237 

09-28: RD: Transport Layer -LID Validation Page 237 

09-58: Packet Validation - IBA Unreliable Multicast Page 238 

09-59: Packet Validation - IBA Unreliable Multicast - QP Page 238 

09-60: Requesters - WQE Completion Responsibility Page 239 

09-61 : Send Queue PSNs - Allowed Outstanding Qty Page 243 

o9-29: RC: PSN Insertion for Reliable Svc Pkts Page 244 

09-29: RD: PSN Insertion for Reliable Svc Pkts Page 244 

o9-30: RC: Responder - Behavior - RC Service Page 247 

o9-30: RD: Responder - Behavior - RD Service Page 247 

09-31: RD: BTH - PSN Field Value for RD Service Page 248 

09-31: RC: BTH - PSN Field Value for Reliable Svc Page 248 

09-32: RC or RD: BTH - Initial PSN for Reliable Service Page 248 

09-33: RD: Requester - PSN Value - RD Service Page 249 

09-33: RC: Requester - PSN Value - Reliable Svc Page 249 

09-34: RD: Validation of EEC RDD Against Send Queue Page 250 

o9-35: RD: EEC vs QP - RDD Mismatch Behavior Page 250 

09-67: Requester - BTH OpCode Field Value Rules Page 250 

09-68: Requester - BTH OpCode Field Value Table Page 250 

09-69: Requester - Packet PayLen - First/Middle Page 250 

09-70: Requester - Packet PayLen - Only Page 251 

09-71 : Requester - Packet PayLen - Last Page 251 

09-36: RDMA: Requester - RETH DMALen Field - Limits Page 251 

o9-37: RD: Responder - Validation of Inbound RD Req Page 251 

09-37: RC: Responder - Valid, of Inbound Requests Page 251 

o9-39: RD: Responder - Valid, of Inbound RD Req. PSN Page 251 

o9-39: RC: Responder - Inbound Request PSN Chk Page 251 
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O9-40: RD: Responder - ePSN Calculation Rule Page 253 1 

O9-40: RC: Responder - ePSN Calculation Rule Page 253 ^ 

09-41 : RC: Responder - ePSN Update - Rec. Queue Page 253 ^ 

o9-42: RD: Responder - ePSN Update - Rec. Queue Page 254 3 

o9-43: RC: Responder - New Request - Exec/Response .... Page 254 ^ 

o9-43: RD: Responder - New Req. - Exec/Response Page 254 

o9-44: RC: Responder - Valid Duplicate Req Behavior Page 254 5 

o9-44: RD: Responder - Valid Duplicate Req Behavior Page 254 

09-45: RC: Resp. - Inbound PSN Outside Valid Region Page 255 ^ 

o9-45: RD: Resp. - Inbound PSN Outside Valid Region Page 255 7 

o9-46: RC: Resp. - Behavior after NAK Sequence Error Page 256 o 

o9-46: RD: Resp. - Behavior after NAK Sequence Error Page 256 ° 

o9-47: RC: Responder - Validation of OpCode Seq Page 257 9 

o9-48: RD: Responder - Validation of OpCode Seq Page 257 

o9-49: RD: Responder - New Request Rule Page 257 

09-82: Responder - BTH OpCode Field - Validation Page 258 11 

o9-50: RC: Resp. - Request of Unsupported Fen Page 258 . ^ 

o9-50: RD: Resp. - Request of Unsupported Fen Page 258 ^ 

o9-51: RC: Resp. - Reserved OpCode Error - Behavior Page 258 13 

o9-51: RD: Resp. - Reserved OpCode Error - Behavior Page 258 .a 

09-52: RC: Resp. - Incorrect Pad Count Error - Behavior Page 259 ' 

o9-52: RD: Resp. - Incorrect Pad Count Error - Behavior Page 259 1 5 

o9-53: RC: Resp. - Insufficient Res. Error - Behavior Page 259 

o9-53: RD: Resp. - Insufficient Res. Error - Behavior Page 259 

09-54: RC: Resp. - NAK Response - Completion Rule Page 259 17 

o9-54: RD: Resp. - NAK Response - Completion Rule Page 259 ^r. 

09-55: RC and RDMA: Resp - R_Key Unchecked Page 260 ' ^ 

09-55: RD and RDMA: Resp - R_Key Unchecked Page 260 1 9 

09-89: R„Key Violation Behavior Page 260 ^r. 

09-90: R_Key Violation Behavior - Completion Rule Page 260 

09-91 : LRH - PktLen Validation - WQE buffer Page 260 21 

09-92: LRH PktLen Validation - OpCode Check v. MTU Page 261 

09-93: LRH PktLen Validation - Invalid Request Resp Page 261 "^"^ 

09-56: RDMA: DMA Length Field Validation - Behavior Page 261 23 

09-95: PSN Field Value - SEND/RDMA WRITE Resp Page 262 ^/i 

09-57: RDMA: PSN Field Value - RDMA READ Resp Page 263 

09-58: Atomics: PSN Field Value - ATOMIC Op Resp Page 263 25 

09-59: RDMA: AETH MSN Field - RDMA READ Resp Page 264 

O9-60: RDMA: AETH Header - RDMA READ Resp Page 265 

09-61 : RDMA: BTH OpCode Field - RDMA READ Resp Page 265 27 

09-62: RDMA: RDMA READ Response - Error Behavior Page 265 

09-63: RDMA: Request Process - Order - RDMA READ Page 266 

09-102: Response is Required Page 267 29 

09-103: Update of ePSN - Error Behavior Page 267 

09-104: Response to ATOMIC or RDMA READ Request Page 267 

09-105: Duplicate SEND Behavior Page 268 31 

09-64: RDMA: Duplicate RDMA WRITE Behavior Page 269 

09-106: Duplicate SEND/RDMA WRITE - Error Behavior Page 269 

09-65: RDMA: RDMA READ Responses - Duplicates Page 271 33 

09-66: Atomics: Duplicate ATOMIC Op Req. Behavior Page 272 

o9-67: Atomics: Duplicate ATOMIC Op Req. Error Page 273 

09-68: Atomics: Duplicate ATOMIC Req. -Local Error Page 273 35 

09-111 : NAK PSN Field Value - Except for RDMA READ Page 273 

09-112: NAK PSN Field Value - RDMA READ Page 273 

09-113: RNR NAK - PSN Field Value Page 273 37 

09-114: Wait for first valid ePSN after Sequence Error Page 273 oo 

09-115: Response to Duplicate Requests - except NAK Page 273 

09-116: BTH AckReq Field - Behavior Page 274 39 

09-69: RDMA: PSN Field Value - RDMA READ Resp Page 275 

O9-70: RDMA: AETH Requirement Page 276 

09-71 : RD: AETH Syndrome - Defined Values Page 277 41 

09-71 : RC: AETH Syndrome - Defined Values Page 277 ^2 
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09-72: RD: AETH Syndrome Value - MSN Invalid Page 277 

C9-120: Request - Malformed ACK Message Rule Page 278 

C9-121 : Responder - PSN Field Value - Sequence Error Page 279 

C9-122: NAK Sequence Error - Subsequent Behavior Page 279 

C9-123: PSN Field Value - Duplicate Request - Behavior Page 279 

09-73: RDMA: BTH Field Value - NAK Remote Access Page 280 

o9-73: Atomics: BTH Field Value - NAK Remote Access Page 280 

C9-125: BTH Field Value - NAK Invalid Request Page 280 

C9-126: BTH Field Value - NAK Remote Operational En- Page 280 

09-74: RD: EEC Field Value - P_Key mismatch Page 281 

C9-127: Dest QP Field Value - NAK Invalid RD Request Page 281 

09-75: RC: Requester - PSN Uniqueness - RNR NAK Page 282 

09-76: RD: AETH Field Value - RNR NAK Timer Page 282 

09-76: RC: AETH Field Value - RNR NAK Timer Page 282 

09-77: RD: RNR NAK Retry - Counting and Behavior Page 282 

09-77: RC: RNR NAK Retry - Counting and Behavior Page 282 

09-133: Packet Header Validation - Transport Page 284 

09-78: RC: ACK PSN Field Value - Order Detection Page 284 

09-78: RD: ACK PSN Field Value - Order Detection Page 284 

09-79: RC: ACK Syndrome Field Value - Error Behavior Page 285 

09-79: RD: ACK Syndrome Field Value - Error Behavior Page 285 

O9-80: RC: Ghost ACKs - Req'd Behavior Page 285 

O9-80: RD: Ghost ACKs - Req'd Behavior Page 285 

o9-81: RC: Repeated NAK Seq. En^ors - Behavior Page 286 

09-82: APM and RC: Repeated NAK Seq. Errors Page 286 

o9-83: RC: Requester - Duplicate ACK Behavior Page 287 

09-83: RD: Requester - Duplicate ACK Behavior Page 287 

09-84: RC: ACK/NAK Timer - Outstanding SEND Req Page 288 

09-85: RD: ACK/NAK Timer -Outstanding SEND Req Page 288 

o9-87: RC: Timeout Rules for Outstanding Requests Page 290 

o9-88: RD: Timeout Rules for Outstanding Requests Page 290 

09-89: RC: End-to-End Flow Control Credit - Dupl. ACKs Page 290 

o9-90: RC: Duplicate ACKs - Behavior Page 291 

09-91: RC: Reliable Connection and Reliable Service Page 291 

09-92: RC: AETH MSN Field Value - RC Service Page 292 

09-93: RC: Responder - MSN Calculation Page 293 

09-94: RC: AETH MSN Field Value Page 296 

o9-95: Receive Queue - End-to-End Flow Control Credit .... Page 297 

09-151: End-to-End Flow Control - Send Queue Behavior .... Page 297 

09-152: AETH - MSN Field Value - Unsolicited ACK Page 297 

09-153: End-to-End Flow Control Rules Page 298 

09-1 54: End-to-End Flow Control - Syndrome for Disable Page 298 

09-155: End-to-End Credit - Usage Page 298 

09-156: End-to-End Flow Control - Lack of Initial Credit Page 298 

o9-96: RC: End-to-End Flow Control Credit Calc/Update Page 301 

09-97: RC: End-to-End Flow Control Credit - AETH Page 301 

09-159: Requester - Send Queue Behavior - Credit Limit Page 301 

09-160: Requester Behavior - Transaction Ordering Rules .... Page 302 

09-161: End-to-End Flow Control - Encoded Count Page 302 

09-162: Requester Behavior - Send Queue - WOE Limit Page 304 

09-98: RC: SEND Request - Limited WOE Case - 1 Pkt Page 305 

09-99: RDMA WRITE - Request Xmt - AckReq bit Page 305 

09-164: Requester - Ability to Receive Unsolicited ACK Page 305 

O9-100: RD: QP Availability, Capabilities Page 307 

O9-101: RD: EEC Support and Capabilities Page 308 

O9-102: RD: RD Message Completion - Single Msg EEC Page 308 

O9-103: RD: RD Message Completion - Single Msg QP Page 308 

O9-104: RD: OpCode, ETH, Transport Validation, etc Page 308 

O9-105: RD: Error Detection and Handling Page 308 

O9-106: RD: Communication Management Support Page 308 

O9-107: RD: EE Context - Ability to Avoid Shutdown Page 308 

09-111: RD: NAK-RNR Behavior for Over-run Condition Page 314 
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09-112: RD: Out of Order Receive Queue Completion Page 315 

09-113: RD: Send Queue - WOE Connpletion Order Page 315 

09-114: RD: Upper Layers - Tolerate of Out of Order Pkts Page 315 

C9-165: Transport - Packet Header Validation Page 316 

09-115: UC: PSN Examination for Packet Validation Page 316 

09-116: UC: Opcode Examination Page 317 

C9-168: BTH OpCode Validation - Support for Request Page 317 

C9-169: Inbound Request - Resources to Receive Page 317 

09-117: UC and RDMA: R_Key Validation - Behavior Page 317 

09-171: Inbound Request Packet - Validation - UD Page 317 

09-118: UC: Inbound Request Packet - Validation Page 317 

09-119: UC: BTH PSN Field Value - Cun^ent PSN Page 319 

O9-120: UC: BTH PSN Field Value - First Request Packet .... Page 320 

09-121: UC: PSN Update/Modify - Transport Control Page 320 

09-122: UC: BTH PSN Field Value - Calculation Page 320 

09-123: UC: Packet OpCode - First/Middle/Last/Only Page 321 

09-124: UC: Packet Payload Length - Opcode Page 322 

09-125: UC: Message Completion Rule - SEND/WRITE Page 322 

09-126: UC: Expected PSN Value Page 322 

09-127: UC: Expected PSN Update/Modify Page 323 

09-128: UC: Inbound Request Pkt - Ordering - Detection Page 323 

09-129: UC: Inbound Request Packet - New ePSN Page 323 

O9-130: UC: BTH PSN Inbound Pkt - Compare to ePSN Page 324 

09-131 : Notification to Client, One or More Lost Messages. . . . Page 324 

09-132: UC: Message Drop/Restart Rule Page 324 

09-133: UC: Inbround Request - OpCode Check Page 325 

09-134: UC: Invalid OpCode Behavior Page 325 

09-135: UC: Invalid OpCode Behavior - New Message Page 326 

C9-190: Unreliable Connection - Valid Function Check Page 326 

09-191 : Invalid UC Request - Behavior Page 326 

09-136: UC and RDMA: RETH R_Key Validation Page 327 

09-137: UC and RDMA: RETH R_Key - zero-len WRITE Page 327 

09-138: UC: LRH PktLen Check - Sufficient Recv Buffer Page 327 

09-139: UC: LRH PayLen Check- OpCode First/Middle Page 328 

O9-140: UC: LRH PayLen Check- OpCode Only Page 328 

09-141: UC: LRH PayLen Check- OpCode Last Page 328 

09-142: UC and RDMA: Total Payload Data Check Page 328 

09-143: UC: Pad Count Check - OpCode First/Middle Page 328 

09-200: Message Size Limit - Unreliable Datagram Page 330 

09-201 : Basic Services - Unreliable Datagram Reqmts Page 330 

09-202: Unreliable Datagram Error Handling Page 330 

09-203: PSN Generation and Message Completion - UD Page 332 

09-204: PSN Calculation - Unreliable Datagram Page 332 

09-205: Responder - OpCode/Length/Completion - UD Page 332 

09-144: Responder - PSN Treatment - UD Page 332 

09-206: BTH OpCode Field Value - Validation - UD Page 333 

09-207: Inbound SEND Request - Queue Entry - UD Page 333 

09-208: Packet Headers - Raw vs. IPv6 NxtHdr Page 334 

09-145: RawD: Packet Payload and LRH PktLend Pad Page 335 

09-146: RawD: Association of QPs with a Raw Service Page 335 

09-147: RawD: QPs Supporting Raw Service Page 335 

09-148: RawD: Maximum Raw Datagram Pkt Payload Page 335 

09-209: Requester - Locally Det. Xmt Error - UD Page 338 

09-149: RC or UC: Requester - Locally Det. Xmt Error Page 338 

O9-150: RD: Requester, Transmit - Locally Detected Error .... Page 338 

09-151 : RD: Requester - Excessive Retry Detection Page 338 

09-151 : RC: Requester - Excessive Retry Detection Page 338 

09-152: APM: Migration Attempt Allowed following Errors Page 339 

09-211 : Requester - Error Behavior and Fault Class Table .... Page 340 

09-153: RC: Requester - Class A Error Behavior Page 344 

09-153: RD: Requester - Class A Error Behavior Page 344 

09-154: RC: Requester - Class A Errors - Client Rule Page 344 
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09-154: RD: Requester - Class A Errors - Client Rule Page 344 

09-214: Requester - Class B Error - Behavior Page 345 

09-215: Requester - Class B En^or - Discard ACKs Page 346 

09-216: Requester - Class C Error Behavior Page 346 

09-155: RD: Requester - Class C Error Behavior Page 346 

09-156: RD: Requester - Class D Error - Behavior Page 346 

09-157: RC: Requester - Class E Error - Behavior Page 348 

09-157: RD: Requester - Class E Error - Behavior Page 348 

09-218: Requester - Class F Error - Behavior Page 348 

09-219: Responder - Error Behavior and Fault Class Table . . . Page 349 

09-220: Responder - Class A Error - Behavior Page 352 

09-158: RC: Responder - Class B Error - Behavior Page 353 

09-158: RD: Responder - Class B Error - Behavior Page 353 

09-159: RC: Responder - Class C Error - Behavior Page 354 

09-223: Responder - Class D Error - Behavior Page 354 

O9-180: UC: Responder - Class D1 Error - Behavior Page 356 

09-161 : RD: Responder - Class E Error - Behavior Page 357 

09-162: RD: Responder - Class F Error - Behavior Page 357 

09- 225: Responder - Class G Error - Behavior Page 358 

010- 119: Received packet discarded if P_Key mismatch Page 429 

010-121: P_Key Table size reqs Page 429 

010-123: P_Key Table initialization wrt non-volatile storage Page 430 

010-124: P_Key checking for incoming packets Page 430 

O10-55: P_Key traps: general requirements Page 430 

O10-56: P_Key traps: new violation before trap sent Page 431 

O10-57: P_Key counters: general requirements Page 431 

010-130: Partitioning reqs same as for CI; exception Page 432 

010-131: No P_Key checking on packets sent to SMI Page 433 

010-132: Special P_Key checking for packets sent to GSI Page 433 

010-133: P_Key for packets sent from GSI Page 433 

012-1: CM protocol support req'd with RC, UC, and RD Page 520 

012- 12: Conditions when SIDR_REQ msg support req'd Page 521 

013- 28: ClassPortlnfo Required for each Mgt Class Page 589 

013-29: ClassPortlnfo Required For Each GS Class Page 589 

013-30: SA ClassPortlnfo Required Page 589 

o13-1: Notice: Notice Data Layout Page 592 

o13-2: Notice: Informlnfo Data Layout Page 593 

013-32: No Traps Without TrapDLID Target Page 594 

o13-3: Trap: Maximum Rate of Generation Page 595 

o1 3-4: Trap: Use of Notice Attribute Page 595 

o13-5: Trap: Transaction ID setting Page 595 

o13-6: Trap: Response to TrapRepress Page 595 

o13-7: TrapRepress Dropped if No Matching Trap Page 595 

o13-8: Notice: Notice Queue is FIFO Page 596 

o13-9: Notice: Action When Too Many Notices Requested . . . Page 596 

O13-10: Notice: Meaning of NoticeCount Page 596 

013-11: Notice: Response to Set(Notice) Page 596 

013-33: SM MADs (SMPs) appear on Port 0 Page 601 

013- 37: SMP Processing Above/Below the Verb Layer Page 601 

014- 14: Subnet Management Agent Required Methods Page 623 

014-15: M_Key not Checked When Portlnfo:M_Key = 0 Page 624 

014-16: M_Key checks when Portlnfo:M_Key is not zero Page 624 

014-17: Lease Period Timer Countdown Page 624 

014-18: Portlnfo:M_KeyViolations Counting Page 625 

014-19: Lease Period Counting Halts on valid M_Key Page 625 

014-20: M_KeyProtectBits When Lease Period Expires Page 625 

014-21: M_KeyLeasePeriod 0 = Lease Never Expires Page 625 

014-22: M_Key, ProtectBits, & LeasePeriod Set Together Page 626 

014-23: Init of M_Key, ProtectBits & LeasePeriod Page 626 

014-24: SMA Required Attributes Page 626 

014-25: Portlnfo Set when M_Key is 0 Page 651 

014-26: Portlnfo Set when M_Key is not 0 Page 651 
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CI 4-27 
C14-28 
C14-29 
C14-30 
C14-31 
C14-32 
014-1 
014-2 
014-3 
014-4 
014-5 
014-6 
C14-33: 
C14-34: 
014-7: 
014-8: 
014-9: 
O14-10 
014-11 
014-12 
C16-1: 
C16-2: 
C16-3: 
C16-4: 
C16-5: 
016-1: 
C16-6: 
C16-7: 
016-2: 
C16-9: 
C16-10 

016- 3: 
C16-11: 
C16-12 
C16-13 
C17-2 
C17-3 
C17-4 
C17-5 
C17-6 
CI 7-8 
CI 7-9 
C17-10 
C17-14 

017- 1: 
C17-17 
017-2: 
017-3: 
o17-4: 
C17-19 
CI 7-20 
CI 7-21 
CI 7-22 
C17-23 
C17-24 



Requests to Change RO Registers Are Ignored Page 651 

SubnGetResp Generation when M_Key is 0 Page 651 

SubnGetResp Generation when M_Key nonO Page 652 

SubnGetResp Content Page 652 

SubnGetResponse TransactionID Page 652 

SubnGetResp MADHeader:M_Key Page 652 

Trap: SubnTrap M_Key field Page 652 

Trap: Trap Generation Interval Page 652 

Trap: Only Sent When Portstate is Active Page 652 

Trap, Notice: Monitoring PortState Page 653 

Trap: Trap 128 on Port State Change Page 653 

Notice: Logged on Port State Change Page 653 

P_Key and Q_Key Mismatches Monitored Page 653 

P_Key or Q_Key Violation Count Reporting Page 653 

Trap: P_Key, Q_Key Violation =Trap 257, 258 Page 653 

Notice: Must Log P_Key & Q_Key Violations Page 653 

Trap: trap 256 On M_Key Mismatch Page 654 

Notice: M_Key mismatch is logged Page 654 

Trap: trap 129, 130, or 131 When Link Problems Page 654 

Notice: Must Log Link Problems Page 654 

PM Agent is mandatory on all nodes Page 717 

PM MADs Follow Common MAD Format & Usage. . . . Page 718 

PortSamplesControl, PortSamplesResult Req'd Page 721 

Each sampler must have >=1 & <=15 counters Page 721 

PMA Mandatory quantities: all ports, all nodes Page 726 

Optional Performance Counters: Attributes Page 726 

PortCounters Attribute is Mandatory. Page 732 

Counters Init at 0 and Saturate at All 1s Page 732 

Optional Performance Counters: Use Page 736 

BMA Mandatory on all nodes Page 749 

BM datagrams follow common MAD format and use . . Page 751 

Trap: BM Trap DataDetails Layout Page 756 

BMA checks B_Key Page 759 

BMA Action when B_Key check Fails Page 760 

B_Key, B_Key Protection, B_Key lease at reset Page 760 

Multiport CAs Shall Support Multiple Subnets Page 792 

Association of QPs with Ports Page 794 

Static Rate Control - Ports above 2.5 Gbps Page 796 

CA Ports Must Validate P_Keys on Packets Page 796 

P_Key Table Size per Port Page 796 

Each Port Must Support at least One GID Page 796 

All QPs Shall Source and Sink Local Packets Page 801 

Except QPO, All QPs Shall Handle GRH Packets. .... Page 801 

MTU Support - Valid Sets Page 801 

Receive Queues - E-to-E Flow Control Credit Page 801 

Send Queue - E-to-E Flow Control Credit Page 801 

UDMcast: Generation Page 801 

UDMcast: Receiving Page 802 

APM: Respond to, Generate Auto Path Migrate Page 802 

Backpressure Rule to avoid Deadlock Page 802 

Backpressure Inbound/Outbound - Deadlock Page 802 

Inbound Pkts - Link/Network/Transport Check Page 802 

EUI-64 GUID In Non-Volatile Memory Page 802 

Priority of Attached SM - Non-Volatile Memory Page 802 

QPO and QP1 Support Req'd for Every Port Page 803 
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20.5 Switch Compliance Category 



In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Switch, a product shall meet all requirements 
specified in this section, except for those statements preceded by Quali- 
fiers that the product does not support. In addition, a compliant Switch 

shall meet all Section 20.13 Common Port Requirements on oaae 878 
and all Section 20.14 Common MAD Requirements on page 879 . 

• C4-1: EUI-64 Assignment Page 109 

• C4-3: EUI-64 Assignment - At Least One per Port Page 110 

• 04-5: LID (Local Identifier) Usage and Properties Page 114 

• 07-5: How to Corrupt a Packet Page 139 

• o7-1: Truncation is Allowed when Corrupting a Packet Page 139 

• 07-6: Packet Truncation Rule Page 139 

• 07-9: Packet Check Rule for Management Packets Page 141 

• 07-22: VL15 Buffer(s) required For Each Switch Page 148 

• 07-24: Inbound VL15 Packets Stay in VL15 Going Out Page 149 

• 07-33: SL on Packets Must be Invariant in a Subnet Page 151 

• o7-6: VLs: SL-to-VL Mapping Rules Page 152 

• o7-7: VLs: SL-to-VL Mapping Table Size Page 152 

• o8-1: Optional Use of GRH in Packets Page 192 

• 010-118: P_Key value not modified in fonwarded packet Page 429 

• 010-120: SMA port contains P_Key table Page 429 

• 010-134: General partitioning requirements for GSI QP Page 433 

• 013-28: ClassPortlnfo Required for each Mgt Class Page 589 

• 013-29: ClassPortlnfo Required For Each GS Class Page 589 

• 013-30: SA ClassPortlnfo Required Page 589 

• o13-1: Notice: Notice Data Layout Page 592 

• o13-2: Notice: Informlnfo Data Layout Page 593 

• 013-32: No Traps Without TrapDLID Target Page 594 

• o13-3: Trap: Maximum Rate of Generation Page 595 

• o13-4: Trap: Use of Notice Attribute Page 595 

• o13-5: Trap: Transaction ID setting Page 595 

• o13-6: Trap: Response to TrapRepress Page 595 

• o13-7: TrapRepress Dropped If No Matching Trap Page 595 

• 013-8: Notice: Notice Queue is FIFO Page 596 

• o13-9: Notice: Action When Too Many Notices Requested , . . Page 596 

• O13-10: Notice: Meaning of NoticeCount Page 596 

• 013-11: Notice: Response to Set(Notice) Page 596 

• 013-33: SM MADs (SMPs) appear on Port 0 Page 601 

• 013-37: SMP Processing Above/Below the Verb Layer Page 601 

• 014-14: Subnet Management Agent Required Methods Page 623 

• 014-15: M_Key not Checked When Portlnfo:M_Key = 0 Page 624 

• 014-16: M_Key checks when Portlnfo:M_Key is not zero Page 624 

• 014-17: Lease Period Timer Countdown Page 624 

• 014-18: Portlnfo:M_KeyViolations Counting Page 625 

• 014-19: Lease Period Counting Halts on valid M_Key Page 625 

• 014-20: M_KeyProtectBits When Lease Period Expires Page 625 

• 014-21: M_KeyLeasePeriod 0 = Lease Never Expires Page 625 

• 014-22: M_Key, ProtectBits, & LeasePeriod Set Together Page 626 

• 014-23: Init of M_Key, ProtectBits & LeasePeriod Page 626 

• 014-24: SMA Required Attributes Page 626 

• 014-25: Portlnfo Set when M_Key is 0 Page 651 

• 014-26: Portlnfo Set when M_Key is not 0 Page 651 

• 014-27: Requests to Change RO Registers Are Ignored Page 651 

• 014-28: SubnGetResp Generation when M_Key is 0 Page 651 

• 014-29: SubnGetResp Generation when M_Key nonO Page 652 

• 014-30: SubnGetResp Content Page 652 

• 014-31: SubnGetResponse TransactionID Page 652 
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C14-32: 
014-1 
014-2 
014-3 
014-4 
014-5 
014-6 
C14-33: 
C14-34: 
014-7: 
014-8: 
014-9: 
O14-10 
014-11 
014-12 
C16-1: 
C16-2: 
C16-3: 
C16-4: 
C16-5: 
016-1: 
C16-6: 
C16-7: 
016-2: 
C16-9: 
C16-10: 
016-3: 
C16-11 
C16-12 
C16-13 
C18-1: 
C18-2: 
018-1: 
018-2: 
C18-3: 
018-3: 
018-4: 
018-5: 
018-6: 
018-7: 
C18-4: 
C18-5: 
C18-6: 
C18-7: 
C18-8: 
C18-9: 
C18-10 
C18-11: 
018-8: 
C18-12 
C18-13 
C18-14 
C18-15 
C18-16 
C18-17 
018-9: 
018-10 
018-11 
018-12 
018-13 
018-14 
018-15 



SubnGetResp MADHeader:M_Key Page 652 

Trap: SubnTrap M_Key field Page 652 

Trap: Trap Generation Interval Page 652 

Trap: Only Sent When Portstate is Active Page 652 

Trap, Notice: Monitoring PortState Page 653 

Trap: Trap 128 on Port State Change Page 653 

Notice: Logged on Port State Change Page 653 

P_Key and Q_Key Mismatches Monitored Page 653 

P_Key or Q_Key Violation Count Reporting Page 653 

Trap: P_Key, Q_Key Violation =Trap 257, 258 Page 653 

Notice: Must Log P_Key & Q_Key Violations Page 653 

Trap: trap 256 On M_Key Mismatch Page 654 

Notice: M_Key mismatch is logged Page 654 

Trap: trap 129, 130, or 131 When Link Problems Page 654 

Notice: Must Log Link Problems Page 654 

PM Agent is mandatory on all nodes Page 717 

PM MADs Follow Common MAD Format & Usage. . . . Page 718 

PortSamplesControl, PortSamplesResult Req'd Page 721 

Each sampler must have >=1 & <=15 counters Page 721 

PMA Mandatory quantities: all ports, all nodes Page 726 

Optional Performance Counters: Attributes Page 726 

PortCounters Attribute is Mandatory Page 732 

Counters Init at 0 and Saturate at All 1s Page 732 

Optional Performance Counters: Use Page 736 

BMA Mandatory on all nodes Page 749 

BM datagrams follow common MAD format and use . . Page 751 

Trap: BM Trap DataDetails Layout Page 756 

BMA checks B_Key Page 759 

BMA Action when B_Key check Fails Page 760 

B_Key, B_Key Protection, B_Key lease at reset Page 760 

Forwarding Table - Linear or Random, Not Both Page 814 

Unicast Forwarding Table - Size Limits Page 814 

UDMcast: Packet Replication by Switch Page 814 

UDMcast: Multicast Fonwarding Table - Size Page 814 

VL1 5 is Required Page 814 

VL15 Buffer Resource May Be Shared Page 814 

VLs: SLtoVL Mapping Function Page 815 

SL to VL Mapping - Single Data VL Page 815 

P_Key SRE_ln: Inbound P_Key Enforcement Page 815 

P_Key SRE: Outbound P_Key Enforcement Page 815 

Size Requirement for Forwarding VL15 Packets Page 815 

Legal MTU Configurations - Across All Ports Page 815 

Size Requirement for Forwarding Data Packets Page 815 

Switch Initialization Rules Page 816 

Physical Layer Compliance (Excludes Port 0) Page 817 

Link Layer Compliance (Excludes Port 0) Page 81 7 

Packet Relay - Port 0 Rule Page 817 

Port 0 Behavior - Transport Requirements Page 817 

Port 0 Behavior - Differences from Other Ports Page 817 

Port 0 - Required LMC Value Page 817 

Port 0 - LMC Value Change Disallowed Page 817 

Port 0 - Get Response - LMC Value Page 817 

Receiver Queueing - Packet VL Field Rule Page 818 

Receiver Queueing - Inbound Raw Packet Filter Page 818 

Link Layer Flow Control - No Excuse for Discard Page 818 

P_Key SRE_ln: Inbound Enforcement Disabled Page 818 

P_Key SRE_ln: Inbound P_Key List is Per Port Page 818 

P_Key SREJn: P_Key Port's List Same for In/Out . . . Page 818 

P_Key SREJn: Inbound P_Key Table Size Per Port . . Page 818 

P_Key SRE_ln: Inbound P_Key List - Programmable . Page 818 

P_Key SRE_ln: Inbound Enforcement - IBA Packets. . Page 818 

P_Key SRE_ln: Inbound Enforcement - Raw Pkts .... Page 819 
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► 018-16: P_Key SREJn: Inbound Enforcmt - IBA Pkt Discard . . Page 819 

► C18-18: Data Packet Relay Rule Page 819 

► C18-19: VL15 Packet Relay Rule Page 819 

► C18-20: Packet Relay - Same Port Rules Page 819 

► C18-21 : Modification of Data Within Packet Disallowed Page 820 

► C18-22: Forwarding Rule - Pernnissive Address in DLID Page 820 

► 018-17: Port 0/SMI/GSI - Packet Discard Rule Page 820 

► C18-23: Packet Transmission - From Port 0 to Other Ports Page 820 

» 018-18: VLs: SL to VL Mapping - Change VL Field in LRH .... Page 820 

► 018-19: SL to VL Mapping - Single Data VL Page 820 

► O18-20: VLs: SL to VL Mapping - Outbound VL Rule Page 820 

► 018-21 : VLs: SL to VL Mapping - Discard Rule Page 820 

► 018-22: Single Data VL - Packet Relay - VL Field Rule Page 821 

► C18-24: Single Data VL - Packet Relay - Outbound VL Page 821 

► C18-25: non-VL 15 Packets - Receive Queue Page 821 

► C18-26: Packet Relay - Port Behavior if a VL Stalls Page 821 

► 018-23: VL 15 Packets - Packet Relay - Discard Rule Page 821 

► C18-27: Packet Transmission - In Order Delivery Rules Page 821 

► 018-24: Forwarding to Get Around Blocked VLs Page 821 

' C18-28: Forwarding Table - Legal Configurations Page 821 

' 018-29: Multicast Relay - Required Behavior Page 821 

» 018-25: UDMcast: Multicast Relay - Packet Replication Page 822 

' 01 8-30: MulticastFDBCap Rule for Non-UDMcast Switch Page 822 

' 018-31: Linear Fonwarding - Table Properties Page 822 

' 018-32: Linear Fonwarding - Advertised Table Size Page 822 

' 018-33: Linear Fonwarding - Advertised Random Capability . . . Page 822 

' 018-34: Linear Fonwarding Table - Programming Page 822 

• 018-35: Linear Fonwarding Table - LinearFDBTop Value Page 822 

' 018-36: Linear Fonwarding - Unicast Discard Rules Page 822 

• 018-37: Random Forwarding Table - Properties Page 823 

» 018-38: Random Forwarding Table - DefaultPort Page 823 

» 018-39: Random Forward Table - Fonward to DefaultPort Page 823 

» 018-40: Random Forward Table - Discard Rule Page 823 

' 018-41 : Random Forward Table - Added Discard Rule Page 823 

» 018-42: Random Forward Table - Advertised Size Page 823 

► 018-43: Random Forward Table - LinearFDBCap Value Page 824 

» 018-26: Random Forward Table - Per Port Limit Option Page 824 

» 018-44: Random Fwd Table - Advertised Per Port Limit Page 824 

» 018-45: Random Fwd Table - LIDsPerPort Value Page 824 

» 018-46: Random Forwarding Table - LID/LMC Support Page 824 

' 018-47: Primary and Non-Primary Port Value Rules Page 824 

' 018-48: Primary/Non-Primary Port Value - Programming Page 824 

' 018-49: Required Multicast - Fwd to Primary Mcast Port Page 824 

' 018-50: Required Mcast - Fwd to Non-Primary Mcast Port Page 825 

018-51: Required Multicast - Discard Rule Page 825 

018-27: UDMcast: Packet Replication Page 825 

018-28: UDMcast: Multicast Forwarding Table and LIDs Page 825 

018-29: UDMcast: Multicast Forwarding Table Size Page 825 

O18-30: UDMcast: Advertising Multicast Fwd Table Size Page 825 

018-31: UDMcast: Multicast Packet Replication/Relay Page 825 

018-32: UDMcast: Outbound VL Field Value Page 826 

018-33: P_Key SRE_Out: Outbound Enforcement Disabled . . . Page 826 

018-34: P_Key SRE_Out: Outbound P_Key List is Per Port . . Page 826 

018-35: P_Key SRE_Out: P_Key Port's List Same for In/Out . . Page 826 

018-36: P_Key SRE_Out: Outbound P_Key Table Size Page 826 

018-37: P„Key SRE_Out: Outboud P_Key List - Programmable Page 826 

018-38: P_Key SRE^Out: Outbound Enforcement - IBA Pkts . . Page 826 

018-39: P_Key SRE_Out: Outbound Enforcement - Raw Pkts . Page 827 

O18-40: P^Key SRE^Out: Outbound Enf. - IBA Pkt Discard . . . Page 827 

018-52: Transmission - Outbound Raw Packet Filter Page 827 

018-53: Transmission - Valid VCRC Required Page 827 

018-54: Transmission - Valid egp Character Page 827 
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• C18-55: Transmission - VL Arbitration Required Page 827 1 

• C18-56: Physical Layer Compliance - Excludes Port 0 Page 828 ^ 

• C18-S7: Link Layer Compliance - Excludes Port 0 Page 828 

• 018-58: Transmit Queueing - Discard Rules Page 828 3 

• 018-59: Transmit Queueing - Switch Lifetime Limit Page 828 ^ 

• 018-41: Transmit Queueing - Discard Rule in Fast Switch .... Page 829 

• 018-60: Packet Transmit - Truncation and Marking Bad Page 829 5 

• 018-61: Subnet Managment Interface is Required Page 829 r. 

• 018-62: General Services Interface is Required Page 829 

• 018-63: General Service Interface - GSI P_Key Reqmts Page 829 7 
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20.6 Router Compliance Category 

In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Router, a product shall meet all requirements 

specified in this section, except for those statements preceded by Quali- 
fiers that the product does not support. In addition, a compliant Router 
shall meet all Section 20.13 Common Port Requirements on oaae 878 
and all Section 20.14 Common MAD Requirements on page 879 . 

C4-1 : EU 1-64 Assignment Page 1 09 

C4-2: EUI-64 Assignment - At Least One per Port Page 110 

C4-3: EU 1-64 Assignment - At Least One per Port Page 1 1 0 

C4-4: Addressing Rules Page 114 

C4-5: LID (Local Identifier) Usage and Properties Page 114 

o5-5: RawD: Raw Packet Header Rules Page 128 

o5-6: RawD: EtherType Usage in RWH Page 128 

o5-7: RawD: Raw Packet Length Rule Page 128 

o5-8: RawD: Raw Packet Header Format Page 128 

C7-5: How to Con-upt a Packet Page 139 

o7-1: Truncation is Allowed when Corrupting a Packet Page 139 

C7-6: Packet Truncation Rule Page 139 

C7-9: Packet Check Rule for Management Packets Page 141 

C7-21 : VL15 Buffer(s) required For each Port Page 148 

C7-26: Do Not Fonward VL1 5 Packets Page 149 

o7-5: SL-to-VL Mapping Table Size Page 152 

07- 14: RawDMcast: Raw Multicast Operational Rules Page 183 

08- 1: Optional Use of GRH in Packets Page 192 

C8-12: GRH Modification - IPVer Rule Page 194 

C8-13: GRH Modification - TCIass Rule Page 194 

o8-2: GRH Modification - FlowLabel Rule Page 194 

C8-14: GRH Modification - PayLen Rule Page 194 

C8-15: GRH Modification - NxtHdr Rule Page 194 

C8-16: GRH Modification - HopLmt Rule Page 194 

C8-17: GRH Modification - SGID Rule Page 194 

C8-18: GRH Modification - DGID Rule Page 195 

C10-118: P_Key value not modified in routed packet Page 429 

C13-28: ClassPortlnfo Required for each Mgt Class Page 589 

C13-29: ClassPortlnfo Required For Each GS Class Page 589 

C13-30: SA ClassPortlnfo Required Page 589 

o13-1: Notice: Notice Data Layout Page 592 

o13-2: Notice: Informlnfo Data Layout Page 593 

C13-32: No Traps Without TrapDLID Target Page 594 

o13-3: Trap: Maximum Rate of Generation Page 595 

o13-4: Trap: Use of Notice Attribute Page 595 

o13-5: Trap: Transaction ID setting Page 595 

o13-6: Trap: Response to TrapRepress Page 595 

o13-7: TrapRepress Dropped if No Matching Trap Page 595 

o13-8: Notice: Notice Queue is FIFO Page 596 

o13-9: Notice: Action When Too Many Notices Requested . . . Page 596 

O13-10: Notice: Meaning of NoticeCount Page 596 

013-11: Notice: Response to Set(Notice) Page 596 

013-33: SM MADs (SMPs) appear on Port 0 Page 601 

013-35: SMPs Don't Exit a Subnet Page 601 

013- 37: SMP Processing Above/Below the Verb Layer Page 601 

014- 14: Subnet Management Agent Required Methods Page 623 

014-15: M_Key not Checked When Portlnfo:M_Key = 0 Page 624 

014-16: M_Key checks when Portlnfo:M_Key is not zero Page 624 

014-17: Lease Period Timer Countdown . Page 624 

014-18: Portlnfo:M_KeyViolations Counting Page 625 

014-19: Lease Period Counting Halts on valid M_Key Page 625 
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C 14-20 
C14-21 
C 14-22 
CI 4-23 
CI 4-24 
C14-25 
CI 4-26 
CI 4-27 
CI 4-28 
CI 4-29 
CI 4-30 
C14-31 
CI 4-32 
o14-1 
014-2 
014-3 
014-4 
014-5 
014-6 
C14-33: 
C14-34: 
014-7: 
014-8: 
014-9: 
O14-10 
014-11 
014-12 
CI 6-1 
CI 6-2 
CI 6-3 
C16-4 
C16-5 
016-1 
C16-6 
C16-7 
016-2 
C16-9 
C16-10 
016-3: 
C16-11 
C16-12 
C16-13 
C19-1: 
C19-2: 
C19-3: 
C19-4: 
C19-5: 
C19-6: 
019-1: 
C19-7: 
019-2: 
019-3: 
C19-8: 
C19-9: 
C19-10 
C19-11 
C19-12 
C19-13 
CI 9-1 4 
C19-15 
C19-16 
019-4: 



M_KeyProtectBits When Lease Period Expires Page 625 

M_KeyLeasePeriod 0 = Lease Never Expires Page 625 

M_Key, ProtectBits, & LeasePeriod Set Together Page 626 

Init of M_Key, ProtectBits & LeasePeriod Page 626 

SMA Required Attributes Page 626 

Portlnfo Set when l\/l_Key is 0 Page 651 

Portlnfo Set when M_Key is not 0 Page 651 

Requests to Change RO Registers Are Ignored Page 651 

SubnGetResp Generation when M_Key is 0 Page 651 

SubnGetResp Generation when M_Key nonO Page 652 

SubnGetResp Content Page 652 

SubnGetResponse TransactionID Page 652 

SubnGetResp MADHeader:M_Key Page 652 

Trap: SubnTrap M_Key field Page 652 

Trap: Trap Generation Interval Page 652 

Trap: Only Sent When Portstate is Active Page 652 

Trap, Notice: Monitoring PortState Page 653 

Trap: Trap 128 on Port State Change Page 653 

Notice: Logged on Port State Change Page 653 

P_Key and Q_Key Mismatches Monitored Page 653 

P_Key or Q_Key Violation Count Reporting Page 653 

Trap: P_Key, Q_Key Violation =Trap 257, 258 Page 653 

Notice: Must Log P_Key & Q_Key Violations Page 653 

Trap: trap 256 On M_Key Mismatch Page 654 

Notice: M_Key mismatch is logged Page 654 

Trap: trap 129, 130, or 131 When Link Problems Page 654 

Notice: Must Log Link Problems Page 654 

PM Agent is mandatory on all nodes Page 717 

PM MADs Follow Common MAD Format & Usage. . . . Page 718 

PortSamplesControl, PortSamples Result Req'd Page 721 

Each sampler must have >=1 & <=15 counters Page 721 

PMA Mandatory quantities: all ports, all nodes Page 726 

Optional Performance Counters: Attributes Page 726 

PortCounters Attribute is Mandatory. Page 732 

Counters Init at 0 and Saturate at All Is Page 732 

Optional Performance Counters: Use Page 736 

BMA Mandatory on all nodes Page 749 

BM datagrams follow common MAD format and use . . Page 751 

Trap: BM Trap DataDetails Layout Page 756 

BMA checks B_Key Page 759 

BMA Action when B_Key check Fails Page 760 

B_Key, B_Key Protection, B_Key lease at reset Page 760 

Router Unicast Routing Table - Minimum Size Page 832 

VL15 Required Page 832 

Virtual Lanes - Allowed Configurations Page 832 

Each Port Shall have Independent VL15 Page 832 

VL15 Packets Can't Be Routed Between Ports Page 832 

SL to VL Mapping Req'd for > One Data VL Page 832 

SL to VL Mapping Optional for a Single Data VL Page 832 

Preserve TCIass when Routing Page 832 

P_Key SRE: Inbound Enforcement Page 833 

P_Key SRE: Outbound Enforcement Page 833 

Allowed MTU Configurations Page 833 

Packet Size - MTU + 128 Bytes Page 833 

Power-up Initialization Page 833 

Per-Port Physical Layer Requirements Page 835 

Per-Port Link Layer Requirements Page 835 

Portlnfo Attribute for Each Router Port Page 835 

Each Port Shall Have at Least One GID Page 835 

Use Packet VL to Determine Inbound VL Queue Page 836 

Inbound Raw Packet Filtering Page 836 

P_Key SRE: Inbound Enforcement Rule Page 836 
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019-5: P_Key SRE: Inbound P^Key List is Per-Port Page 836 

019-6: P_Key SRE: Same List at Port for In/Out Page 836 

o19-7: P_Key SRE: Maximum Size of List per Port Page 836 

o19-8: P.Key SRE: Inbound List is Programmable Page 836 

019-9: P_Key SRE: Inbound Pkt Discard except VL15 Page 836 

O19-10: P_Key SRE: Inbound Vers. Check except VL15 Page 837 

C19-17: Packet Relay Based on DGID Page 837 

C19-18: Packet Relay Disallowed for VL15 Packets Page 837 

019-11: VLs: Rules for Outbound VL Field Value Page 837 

C19-19: Tclass Mapping to SL - Best Effort Page 838 

C19-20: SL to VL Mapping and Discard Rules Page 838 

C19-21: Outbound Data - Wait for Credit Available Page 838 

C19-22: Packet Relay Rules - Credit Availability Page 838 

019-23: Packet Ordering Rules Page 838 

019-12: Packet Relay - Credit Availability Work-Around Page 838 

019-24: Backpressure - Must not Cause Deadlock Page 838 

019-25: Discard Rule Based on Hop Count Page 838 

019-26: Hop Count Decremented for Each Relay Page 838 

019-13: P__Key SRE: Outbound Packet Checking Page 839 

019-14: P_Key SRE: Outbound List is Per Port Page 839 

019-15: P_Key SRE: Inbound/Outbound P_Key Lists Page 839 

019-16: P_Key SRE: Outbound Per Port List Size Limit Page 839 

019-17: P_Key SRE: Outbound List Per Port Page 839 

019-18: P_Key SRE: Discard/Truncate Pkt Rule - IBA Page 839 

019-19: P_Key SRE: Discard/T runcate Rule - Raw Pits Page 839 

019-27: Outbound Transmission of Raw Packets Page 840 

019-28: SLtoVL Mapping Function Page 840 

019-29: VCRC Required Page 840 

019-30: EGP Symbol Page 840 

019-31: Physical Layer Compliance Page 840 

019-32: Link Layer Compliance Page 840 

019-33: Packet Size vs. MTU - Taincation Rule Page 841 

019-34: GRH is Required in Received Packets Page 841 

019-35: Packet Lifetime and Head of Queue Lifetime Page 841 

019-36: Packet Transmission Error - Corrupting Packets Page 842 

©19-20: Transmission Errors - Truncation Option Page 842 

019-37: Transmission Error - Packet Corruption Page 842 

019-38: SMI Requirement Page 842 

019-39: GSI Requirement Page 842 
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20.7 Subnet Manager Compliance Category 

In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Subnet Manager, a product shall meet all re- 
quirements specified in this section, except for those statements pre- 
ceded by Qualifiers that the product does not support. 



C10-115 

C10-117 

013-1: 

CI 3-31: 

013-12: 

013-13: 

013-14: 

013-15: 

013-16: 

013-17: 

013-33: 

013-34: 

013- 45: 

014- 8: 
014-12 
014-13 
014-14 
014-15 
014-16 
014-19 
014-20 
014-21 
014-22 
014-23 
014-24 
014-35 
014-36 
014-37 
014-38 
014-39 
014-40 
014-41 
014-42 
014-43 
014-44 
014-45 
014-46 
014-47 
014-48 
014-49 
014-50 
014-51 
014-52 
014-53 
014-54 
014-55 
014-56 
014-57 
014-58 
014-59 
014-60 
014-61 
014-62 
014-63 



Default P_Key value is OxFFFF Page 428 

Use of default and invalid P_Key values Page 429 

Subnet Must Have at Least One SM Page 572 

Trap and/or Notice Must Be Supported Page 594 

Trap: Event Subscription Confirmation Page 597 

Trap: Event Subscription Denial Page 597 

Trap: Set(lnformlnfo) Verification Page 598 

Trap: Set(lnformlnfo) Verification Failure Page 598 

Trap: Set(lnformlnfo) Range Verification Page 598 

Trap: Event Subscription Action Page 599 

SM MADs (SMPs) are sent from Port 0 Page 601 

GSA MADs Directed to QP1 Page 601 

Validation of SMPs Page 605 

Directed Route SMPs Processed by the SMI Page 619 

Directed Route SMPs Handled by SMI Page 621 

Returning Directed Route SMP Handling Page 621 

Subnet Management Required Methods Page 623 

M_Key not Checked When Portlnfo:M_Key = 0 Page 624 

M_Key checks when Portlnfo:M_Key is not zero Page 624 

Lease Period Counter Halts on valid M_Key Page 625 

ProtectBits Setting When Lease Period Expires Page 625 

LeasePeriod of 0 means Lease Never Expires Page 625 

One Set() Sets M_Key, ProtectBits, & LeasePeriod . . Page 626 

Initialization of M_Key, ProtectBits & LeasePeriod .... Page 626 

SM Required Attributes Page 626 

SM always associated with one port, one subnet Page 655 

SM Shall Comply with Initialization State Machine .... Page 655 

Priority, GUID, SM_Key Configurable Out Of Band . . . Page 655 

SM response to SubnGet/Set(SMInfo) Page 656 

SM Enters DISCOVERING at Startup Page 657 

DISCOVERING state Use Of SubnGet{*) Page 657 

Go To Standby When Find Higher Priority SM Page 657 

Become Master When No Higher Priority SM Page 658 

Master SM Musts Provide Paths to Itself Page 658 

Punt Init When M_Key Prohibits Access Page 658 

SM in Standby Does Not Configure Subnet Page 658 

Standby SMs Must poll the Master SM Page 658 

Standby SM enters Discovery if Master Fails Page 658 

Standby to Discovery on Discover Control Packet .... Page 658 

Standby to Not-Active on DISABLE Page 659 

Standby Action on HANDOVER Control Packet Page 659 

SMState Set NOT-ACTIVE in NOT-ACTIVE State .... Page 659 

SM Does not Send Set/Get in NOT-ACTIVE State Page 659 

SM Response to Set/Get in NOT-ACTIVE State Page 659 

Not-Active Response to STANDBY Control Packet . . . Page 659 

Only the Master SM shall configure subnet nodes. . . . Page 660 

Master SM Shall Sweep Subnet Page 660 

Minimum SM Sweep Rate Page 660 

Master SM ActCount Incrementing Page 660 

Master SM Action on Topology Change Page 660 

Discovery Halts on Finding a Lower-Priotity Master . . . Page 660 

Master Action on Finding Higher-Priority SM Page 661 

Master SM shall Initialize Many Things Page 662 

SM Action on Seeing Portstate at Initialize Page 665 
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• 


C14-64: 


• 


C14-65: 


• 


C14-66: 


• 


C14-67: 


• 


C14-68: 


• 


C14-69: 


• 


C 14-70: 


• 


C14-71: 


• 


C14-72: 



Items Master SM Must Check in Sweep Page 666 

SM Shall Not Check M_Key in a SubnGetResp(*). . . . Page 667 

Reply to SubnSet/Get(SMInfo) When :M_Key = 0 ' Page 667 

Reply to SubnSet/Get(SMInfo) When M_Key nonO . . . Page 667 

Master SM Doesn't Check M_Key in SubnTrap() Page 667 

Out-of-Band Disable Mechamism Page 667 

SM Behavior When Disabled Page 667 

SM Behavior When Not Disabled Page 667 

Disabled Can Change At Any Time Page 668 



20.8 Subnet Administration Compliance Category 



In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Subnet Administration, a product shall meet 
all requirements specified in this section, except for those statements pre- 
ceded by Qualifiers that the product does not support. 



013-31 

013-12: 

013-13: 

013-14: 

013-15: 

013-16: 

013-17: 

013-34 

015-1: 

015-2: 

015-3: 

015-4: 

015-5: 

015-7: 

015-1: 

015-8: 

015-2: 

015-3: 

015-4: 

015-5: 

015-6: 

015-10 

015-11 

015-12 

015-13 

015-14 

015-15 

015-16 

015-17 

015-18 

015-19 

01 5-20 

015-21 

015-22 

015-23 

015-24 

015-25 

015-7: 

015-8: 

015-9: 

O15-10: 

015-11: 

015-12: 



Trap and/or Notice Must Be Supported Page 594 

Trap: Event Subscription Confirmation Page 597 

Trap: Event Subscription Denial Page 597 

Trap: Set(lnformlnfo) Verification Page 598 

Trap: Set(lnformlnfo) Verification Failure Page 598 

Trap: Set(lnformlnfo) Range Verification Page 598 

Trap: Event Subscription Action Page 599 

GSA MADs Directed to QP1 Page 601 

SA Must Exist Page 670 

Information Provided by SA Page 670 

SA Must Live and Die with its SM Page 671 

SA MADs Follow GSI & MAD Rules Page 672 

SA Required/Optional Records and Methods Page 673 

SA Methods Page 674 

SAOPT: Optional Methods Page 674 

SA Required Attributes Page 678 

SAOPT: Optional Attributes Page 679 

UDMcast: MCGroupRecord Modify Prohibited Page 693 

UDMcast: MCGroupRecord Mod Means SM Action . . . Page 694 

UDMcast: MCMemberRecord Modify Prohibited Page 695 

UDMcast: MCMemberRecord Mod Means SM Action . Page 695 

Multi-packet Transaction Protocol Must Be Used Page 696 

Access Restrictions for Path Records Page 706 

Access Restrictions for Other Attributes Page 707 

SA Location Found from SM Page 708 

ClassPortlnfo for SA Tells Where SA Is Located Page 708 

SA_KEY Is A Versioning Key Page 708 

SA Event fonwarding Conforms to General Model .... Page 708 

Component Mask Bit Assignments Page 708 

Component Mask Bit=1 Means Match Page 708 

Component Mask Bit=0 Means No Change on Edit . . . Page 709 

AM Oxffffffff = query by template Page 709 

AM not Oxffffffff = query by RID range Page 711 

Query by RID Range Start and End Points Page 711 

Table Query SA_KEY=0 Means Whole Table Page 712 

Table Query SA_KEY not 0 Means Changes Only Page 712 

Action When Table Query SA_KEY Never Current. . . . Page 712 

SA May Refuse A Request, With Status Page 712 

SAOPT: SubnAdmGetBulk retums all records Page 712 

SAOPT: SubnAdmGetBulkResp Has SAResponse . . . Page 712 

SAOPT: SubnAdmGetBulkRespO When No Data Page 713 

SAOPT: SubnAdmGetBulk SA_Key Use Page 713 

SAOPT: State records only modified by Master Page 714 
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• 015-13: SAOPT: Filling a RID Range Page 714 1 

• 015-14: SAOPT: SubnAdnnConfigResp Admin Data & Status . . Page 715 ^ 

• C15-26: SubnAdnnGetO Returns an RA Page 715 ^ 

• C15-27: SubnAdmGet Query by Template Page 715 3 

• C15-28: SubAdmGet SA_Key Ignored Page 715 . 

• 015-15: SubAdmGetof>1 RA is Invalid Page 715 ^ 

• C15-29: SubnAdmSetO Can Add an RA Page 715 5 

• C15-30: SubnAdmSetO Cannot Modify RAs Page 716 c 

• 015-31: SubnAdmSet{) Can Delete RAs Page 716 ^ 

• 015-16: Invalid SubnAdmSetO Deletion is Null Page 716 7 
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C12-2: 


• 
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012-6: 
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012-7: 
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012-8: 


• 


012-9: 
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©12-10: 
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012-11: 
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012-13: 
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012-14: 
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CI 6-1 4: 
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C16-15: 



8 
9 

10 
11 



20.9 Communication Manager Compliance Category 

In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Communication Manager, a product shall '^^ 
meet all requirements specified in this section, except for those state- 
ments preceded by Qualifiers that the product does not support. 

14 

Though a number of optional CM-specific features exist, no CM-unique 1 5 
qualifiers have been defined since the optional CM-specific features are 16 
fully described within the CM Chapter, and most of the chapter's optional ^ j 
compliance statements would require unique qualifiers. 



19 
20 
21 



Required CM adherence to CM protocol Page 520 

CM message content requirements Page 520 20 

REJ message support required Page 520 

Req'd behavior upon receipt of MRA message Page 520 

Reqs if supports sending REQ message Page 520 22 

Reqs if supports sending MRA message Page 520 

Reqs if supports sending REP message Page 520 

Reqs if supports sending RTU message Page 520 24 

Reqs if supports sending DREQ message Page 520 

Reqs if supports sending DREP message Page 520 " 

Req'd snd/rcv msgs if initiates connect requests Page 520 26 

Req'd snd/rcv msgs if accepts connect requests Page 520 

DREP handling is required if sends DREQ msg Page 521 

Reqs if supports sending SIDR_REQ message Page 521 28 

Reqs if supports sending SIDR_REP message Page 521 

APM: Conditions when receiving LAP required Page 521 

APM: Conditions when sending LAP required Page 521 30 

Communication Management Agent Mandatory Page 786 

MADS Conform to Common MAD Format, Use Page 786 ^ ' 

32 

33 

20.10 Performance Manager Compliance Category 34 

In order to claim compliance to the InfiniBand Volume 1 specification to 35 

the Compliance Category of Performance Manager, a product shall meet 35 

all requirements specified in this section, except for those statements pre- 37 
ceded by Qualifiers that the product does not support. 

• C13-31: Trap and/or Notice Must Be Supported Page 594 

• 013-12: Trap: Event Subscription Confinmation Page 597 40 

• 013-13: Trap: Event Subscription Denial Page 597 

• 013-14: Trap: Set{lnformlnfo) Verification Page 598 

42 
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• 013-15: Trap: Set(lnformlnfo) Verification Failure Page 598 

• o13-16: Trap: Set(lnformlnfo) Range Verification Page 598 

• 013-17: Trap: Event Subscription Action Page 599 

• C13-34: GSA MADs Directed to QP1 Page 601 

• C16-2: PM MADs Follow Common MAD Format & Usage. . . . Page 718 

• C16-8: Sample Taking Procedure Page 735 



20.11 Vendor-Defined Manager Compliance Category 



in order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Vendor-Defined Manager, a product shall 
meet all requirements specified in this section, except for those state- 
ments preceded by Qualifiers that the product does not support. 

• C13-31 : Trap and/or Notice Must Be Supported Page 594 

• ©13-12: Trap: Event Subscription Confirmation Page 597 

• 013-13: Trap: Event Subscription Denial Page 597 

• 013-14: Trap: Set(lnformlnfo) Verification Page 598 

• 013-15: Trap: Set(lnformlnfo) Verification Failure Page 598 

• 013-16: Trap: Set(lnformlnfo) Range Verification Page 598 

• 013-17: Trap: Event Subscription Action Page 599 

• C13-34: GSA MADs Directed to QP1 Page 601 



20.12 Optional Management Agent Compliance Category 



In order to claim compliance to the InfiniBand Volume 1 specification to 
the Compliance Category of Optional Management Agent, a product shall 
meet all requirements specified in this section, except for those state- 
ments preceded by Qualifiers that the product does not support. 



DMA: DM Datagrams Follow Common MAD Use Page 763 

DMA and Trap: DM Trap DataDetails Layout Page 767 

SNMP: Datagrams Follow Common MAD Usage Page 773 

SNMP: Multipacket SNMP MADs Filled Page 777 

SNMP: Multipacket SNMP Timeout Action Page 778 

SNMP: Multipacket SNMP Transaction ID Page 778 

SNMP: SNMP Information Passthrough Page 778 

SNMP: SNMP Tunnelling Must Exist Page 780 

VMA: Vendor MADs Follow Common MAD Usage. . . . Page 781 

VMA: Vendor Classes Have ClassPortlnfo Attrib Page 782 

AMA: MADs Conform to Common MAD Use, Format . Page 783 

AMA: Application Classes Support ClassPortlnfo Page 785 
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016-15: 
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20.13 Common Port Requirements 



Multiple Compliance Categories share common Port Requirements. To 
avoid unnecessary duplication, Port Requirements are collected here and 
referenced by the appropriate Compliance Categories. 



C5-1 

C5-2 

05-3 

05-4 

05-5 

C5-6 

C5-7 

05-8 

05-9 

C7-1 

07-2 

07-3 

07-4 

07-8 

07-10 

07-11 

07-12 

07-13 

o7-2: 

07-14: 

07-15: 

07-16: 

o7-3: 

07-17 

07-18 

07-19 

07-20 

07-23 

07-25 

07-26 

07-27 

07-28 

07-29 

07-30 

07-31 

07-32 

o7-4: 

07-34: 

o7-8: 

o7-9: 

07-35: 

O7-10: 

07-11: 

07-12: 

07-36: 

07-37: 

07-38: 

07-39: 

07-40: 

07-41: 

07-42: 

07-43: 

07-44: 

07-45: 

07^6: 



Packet Structure and Packet Header Location Page 118 

LRH Packet Header Format Page 121 

GRH Packet Fonnat Page 121 

BTH Packet Fomiat Page 122 

Datagram Extended Transport Header Format Page 124 

ACK Extended Transport Header Format Page 126 

Payload Size Limited to MTU bytes Page 127 

Last or Only Packet of a Message Page 127 

Pad Field Usage in IBA Transport Packets Page 127 

Port Link State Machine and Tenns Page 1 35 

Allowed Management Command Port States Page 137 

Allowed Mgmt Command Port State Transitions Page 137 

Packet Receiver State Machine Page 138 

Data Packet Check State Machine Page 141 

Packet Check State Machine - Discard Failures Page 141 

Data Packet Checks - Corrupt/Discard Rules Page 141 

Link Packet Check State Machine Page 144 

Virtual Lanes - Rules for Protol-Aware IBA Ports Page 146 

VLs: Flow Control - IBA Port Rules Page 146 

Virtual Lane Field Use - IBA Protocol-Aware Port Page 147 

Virtual Lane Numbering and Legal Configurations .... Page 147 

Virtual Lanes - VLO and VL15 are mandatory Page 147 

VLs: Rules for VL Configurations Page 147 

Data VLs Start from 0 and Go Up Sequentially Page 148 

VL15 is not Subject to Flow Control Page 148 

Discard VL15 if Receive Buffer is Full Page 148 

Protocol-Aware Ports - VL15 Packet Support Page 148 

VL15 Packets Preempt Other Outbound Packets Page 149 

VL15 Packets - SL Usage Rules Page 149 

No GRH Allowed on VL15 Packets Page 149 

VL15 Packets - Payload Maximum Page 149 

Per Port Buffering Resources Page 149 

Use Flow Control Packets to Advertise Credit Page 149 

Link Packet Send and Receive Page 150 

Minimum Buffer Resources per VL Page 150 

VL on Incoming Packet Specifies Receive Buffer Page 150 

VLs: SL-to-VL Mapping and VL Arbitration Rules Page 151 

SL Sourcing Rule for Single-VL Ports Page 151 

VLs: SL-to-VL Mapping and Behavior Page 153 

VLs: SLtoVLMappingTable and Port Behavior Page 154 

VL Arbitration - Packet Ordering Requirements Page 154 

VL Ariaitration Rules - Single Data VL Page 155 

VLs: VL Arbitration Rules - Multiple Data VLs Page 155 

VLs: More VL Arbitr. Rules - Multiple Data VLs Page 155 

LRH Header Format Page 159 

LRH VL Field Value Page 159 

LRH LVer Field Value Page 159 

LRH Reserve Field Value Page 159 

LRH Link Next Header (LNH) Field Value Page 160 

LRH Reserve Field Value Page 160 

LRH Packet Length Field Value Page 160 

LRH Minimum Packet Length - IBA Transport Page 161 

LRH Min. Packet Length - non-IBA Transport Page 161 

LRH Maximum Value for Packet Length Field Page 161 

LRH SLID Field Value Page 161 
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ICRC Field - Required for IBA Transport Packets Page 161 

ICRC Field Value Page 161 

VCRC Field - Required for All Data Packets Page 163 

VCRC Field Value Page 163 

LPCRC Field - Required in All Link Packets Page 164 

LPCRC Field Value Page 164 

Flow Control Packet Rules Page 176 

Flow Contol Packet Fomriat Page 176 

Flow Control Init Operand Page 177 

Flow Control - Nonmal Operand Page 177 

Flow Control Packets - Reserved Op Field Values .... Page 177 

Flow Control - FCTBS - Initialization Page 177 

Flow Control - FCTBS - Initialize PortState Page 177 

Flow Control - ABR counter - Initialization Page 177 

Flow Control - ABR - Update Mechanisnn Page 177 

Flow Control - ABR - Update/Discard Rule Page 1 77 

Flow Control - FCCL - Calculation Method Page 178 

Data Packet Transmission - Credit Rules Page 179 

Flow Control - VL15 Packets are Not Subject Page 179 

UDMcast: Multicast Operational Rules Page 180 

Link Layer - Classification of Errors on Receive Page 186 

Link Layer - Precedence of Error Counters Page 186 

Link Layer - Link Integrity and Overrun Errors Page 186 

Link Layer - Detection of Flow Control Errors Page 187 

Link Layer - Retraining Rules Page 187 

GRH Rule for IPVer Value Page 193 

GRH Rule for TCIass Value Page 193 

GRH Rule for Unused FlowLabel Value Page 193 

GRH Rule for FlowLabel Value If Used Page 1 93 

GRH Rule for PayLen Value Page 193 

GRH Rule for NxtHdr Value - IBA Transport Page 193 

GRH Rule for NxtHdr Value - Raw Transport Page 193 

GRH Rule for HopLmt Value Page 1 93 

GRH Rule for SGID Value Page 194 

GRH Rule for DGID Value Page 194 

GRH Verification/Discard Rules Page 195 

BTH - Required for Packets Using IBA Transport Page 199 

Transport Layer - BTH TVer Field - Validation Page 231 

Static Rate Control - Required Support Criterion Page 366 

Static Rate Control - Progrannnned Injection Rate Page 366 

Static Rate Control - Interpacket Delay Page 367 

Static Rate Control - Unsupp. Value(s) - Behavior .... Page 367 
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MAD Conventions and Data Placement Page 573 

MAD Length Page 574 

MAD Base Format Page 574 

MADHeaderMgmtHeader Values Page 576 

MAD Method Names and Method Values Page 578 

Assigned Method Values Otherwise Unused Page 578 

Class Specific Methods Request/Response Page 578 

Responders Shall Not Coalesce Responses Page 579 

GetRespO Required for Every Valid Get() Page 580 
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GetRespO Required for Every Valid Set() Page 580 

GetResp{) Attribute Same as Request Page 580 

The default RespTimeValue shall be 8 Page 584 

Appropriate RespTimeValue (General) Page 584 

RespTimeValue for other than Report/ReptResp Page 584 

RespTimeValue for Request/Response Seq's Page 586 

Timer Resets in multi MAD Sequences Page 586 

Choosing a Value for MADHeaderTransactionID Page 586 

TID, SGID, MgmtClass to Associate Messages Page 586 

TransactionID for Request Sequences Page 586 

TransactionID for Responses Page 586 

TransactionID for Response Sequences Page 586 

TransactionID for Message Sequences Page 586 

Common Status Field Bit Values for Responses Page 587 

Common Status Field for Request & Message is 0 ... Page 587 

Unused Attribute Modifier is Set to 0 Page 588 

Common Attribute Format for Get, Set, GetResp Page 588 

GMPs Done Below Verbs Invisible to Verbs Page 602 

GMPs Above Verbs Appear on QP1 Page 602 

GMPs Always Validated As If on QP1 Page 603 

GMPs Dispatched to Agents Page 603 

GMP Redirection Page 603 

GMP Redirection-Required Status Page 603 

GRH in Redirection Only If In ClassPortlnfo Page 604 

SMP Validation Page 605 

SMP GetResponseO Status Values Page 605 

Processing of Directed Route SMPs Page 606 

GMP Validation Page 606 

GMP GetResponseO Status Values Page 606 

No QP1 When No GSM Above Verbs Page 607 

Validation of Redirected GMPs Page 607 

Validation of all MADs Page 608 

Action When GMPs Fail Validation Page 608 

GetResponseO Status on Failed Validation Page 608 

Response Packet Construction, no GRH Page 608 

Response Packet Construction With GRH Page 609 

SMP Required MgmtClass Values Page 611 

SMPs Conform to General MAD Format, Use Page 611 

LID Routed SMP Required Format Page 611 

Directed Routed SMP Required Format Page 613 

Only an SM Shall Originate a Directed route SMP.. . . . Page 618 

Directed Route SMP Field Initialization Page 618 

Directed Route SMP UD Field Initialization Page 619 

Outgoing Directed Route SMP Handling Page 619 

Directed Route SMP Field Initialization Page 621 

Directed Route Response Header Initialization Page 621 
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